What "training" actually means (and what it doesn't)
When most people say they want to "train a chatbot on their website," they picture something like a machine learning researcher feeding data into a neural network over days of compute time. That's not what happens here — and it's important to understand the distinction, because it changes your expectations about cost, time, and how updates work.
What actually happens is a three-step process: crawl, embed, and index. The crawler reads your web pages. An embedding model converts your content into vector representations of meaning. Those vectors are stored in a database that the bot queries at runtime. The underlying language model — the one that writes human-sounding responses — is never modified. You're not changing the AI; you're building a specialised library that the AI can reference.
This approach has two major advantages over actual fine-tuning. First, it's instant — the whole process takes minutes rather than hours. Second, it's updatable — when your content changes, you re-crawl and the knowledge base reflects your new content immediately, with no retraining required. The "training" metaphor is intuitive but misleading; "indexing" is more accurate.
What content gets crawled
The crawler visits publicly accessible pages starting from your root domain. In practice, this means:
- Crawled: Homepage, product/feature pages, pricing page, FAQ page, blog posts, documentation site (if it's a public subdomain or subdirectory), about/company pages, use case pages, changelog.
- Not crawled: Pages behind authentication (login required), content rendered exclusively by client-side JavaScript with no server-rendered HTML, pages explicitly blocked by your
robots.txt, and URLs not linked from any other page on your site.
The crawler follows the same rules as search engine crawlers. If Google can index it, Sitepilot can crawl it. If Google can't, Sitepilot can't either — unless you supplement the crawl with manually uploaded content.
One important nuance: JavaScript-heavy single-page applications (SPAs) can be tricky. If your documentation site renders all content via React with no SSR and no static HTML, the crawler may see an empty page or a loading spinner. If you're not sure, test by viewing your page's source HTML (right-click → View Page Source in Chrome) — if the text content of your page doesn't appear in the source HTML, it's likely JavaScript-rendered and may not be crawled correctly.
How to structure your website for the best chatbot results
The quality of your chatbot's answers is directly proportional to the quality and structure of your website content. A few structural improvements can dramatically increase answer accuracy:
- Create a dedicated FAQ page. Questions and their answers are the ideal format for a chatbot knowledge base. Group common questions by topic and write clear, complete answers. Avoid relying on accordion/collapse UI elements that hide text from crawlers.
- Use specific product feature pages. Instead of a single "Features" page with short bullet points, create a separate page for each major feature with full explanations of what it does, how it works, and who it's for. Each page becomes a richer source of retrievable content.
- Use clear headings. The chunking algorithm uses heading tags (
h1,h2,h3) as natural split points. A well-headed page will produce cleaner, more coherent chunks than a wall of text with no headings. - Keep your pricing on a dedicated page with specific numbers. Pricing questions are among the most common support queries. A dedicated pricing page with clear tier names, specific prices, and feature comparisons gives the bot the exact content it needs to answer accurately.
- Avoid putting critical information only in images. Image alt text is crawled, but the text in a pricing table screenshot or a feature infographic is not. If key information exists only in an image, the bot won't be able to retrieve it. Convert that content to HTML text.
- Keep your changelog or release notes public and well-formatted. When customers ask "does your product support X?", they often mean "did you recently add X?" A well-maintained changelog lets the bot give temporally accurate answers.
What to do for content that can't be crawled
Some of your most valuable support content might not be on your public website. Internal wikis, PDF product specs, onboarding guides, and pricing sheets for enterprise tiers all contain information your bot should know about. Sitepilot provides three ways to add this content manually:
- Raw text paste: In the knowledge base editor, you can paste any text directly. This is the fastest option for short pieces of content like a specific FAQ that doesn't exist on your site yet.
- PDF upload: Upload PDF files and they'll be parsed, chunked, and embedded just like web pages. Useful for product documentation, terms of service, or technical specs that live in PDF form.
- Manual Q&A pairs: You can add specific question-and-answer pairs that bypass retrieval entirely and are always returned for matching questions. Use this for your highest-stakes content — your return policy, your cancellation process, your data security commitments — where you want exact control over the answer.
Re-crawling after content updates
The most common support chatbot failure mode isn't a technical bug — it's stale content. A customer asks about your new enterprise tier, which launched two weeks ago, and the bot has no idea what they're talking about because the knowledge base was last crawled three months ago.
The fix is simple but requires discipline: re-crawl whenever you make meaningful content updates. "Meaningful" means changes that affect what customers might ask about — new features, changed pricing, updated policies, new documentation pages. Minor copy edits and style tweaks don't require a re-crawl.
Signs that your knowledge base might be stale: the bot answers questions about old features or discontinued plans; it can't answer questions about something you know is on your website; customers say the bot gave them wrong information that was previously correct. Any of these is a trigger to re-crawl immediately.
Testing your chatbot's knowledge
After every crawl, run a structured knowledge test before you publicise the bot. Here's the method: write down the 10 most common questions you receive via email or your helpdesk. Ask each one to the bot in its exact form, then rephrase each one in two or three different ways. Grade each answer as correct, incomplete, or wrong.
For every failed or incomplete answer, your next task is clear: find where on your website that information should be, improve that content, and re-crawl. Chatbot failures are almost always a content audit in disguise. The questions your bot can't answer are exactly the questions your website doesn't explain clearly enough — and fixing those gaps helps both your bot and your SEO at the same time.