Launch week — 75% off all plans
All articles

How to Train a Chatbot on Your Website Content

When people say 'train a chatbot on your website', they usually don't mean fine-tuning a model. Here's what actually happens and how to get the best results.

What "training" actually means (and what it doesn't)

When most people say they want to "train a chatbot on their website," they picture something like a machine learning researcher feeding data into a neural network over days of compute time. That's not what happens here — and it's important to understand the distinction, because it changes your expectations about cost, time, and how updates work.

What actually happens is a three-step process: crawl, embed, and index. The crawler reads your web pages. An embedding model converts your content into vector representations of meaning. Those vectors are stored in a database that the bot queries at runtime. The underlying language model — the one that writes human-sounding responses — is never modified. You're not changing the AI; you're building a specialised library that the AI can reference.

This approach has two major advantages over actual fine-tuning. First, it's instant — the whole process takes minutes rather than hours. Second, it's updatable — when your content changes, you re-crawl and the knowledge base reflects your new content immediately, with no retraining required. The "training" metaphor is intuitive but misleading; "indexing" is more accurate.

What content gets crawled

The crawler visits publicly accessible pages starting from your root domain. In practice, this means:

The crawler follows the same rules as search engine crawlers. If Google can index it, Sitepilot can crawl it. If Google can't, Sitepilot can't either — unless you supplement the crawl with manually uploaded content.

One important nuance: JavaScript-heavy single-page applications (SPAs) can be tricky. If your documentation site renders all content via React with no SSR and no static HTML, the crawler may see an empty page or a loading spinner. If you're not sure, test by viewing your page's source HTML (right-click → View Page Source in Chrome) — if the text content of your page doesn't appear in the source HTML, it's likely JavaScript-rendered and may not be crawled correctly.

How to structure your website for the best chatbot results

The quality of your chatbot's answers is directly proportional to the quality and structure of your website content. A few structural improvements can dramatically increase answer accuracy:

What to do for content that can't be crawled

Some of your most valuable support content might not be on your public website. Internal wikis, PDF product specs, onboarding guides, and pricing sheets for enterprise tiers all contain information your bot should know about. Sitepilot provides three ways to add this content manually:

Re-crawling after content updates

The most common support chatbot failure mode isn't a technical bug — it's stale content. A customer asks about your new enterprise tier, which launched two weeks ago, and the bot has no idea what they're talking about because the knowledge base was last crawled three months ago.

The fix is simple but requires discipline: re-crawl whenever you make meaningful content updates. "Meaningful" means changes that affect what customers might ask about — new features, changed pricing, updated policies, new documentation pages. Minor copy edits and style tweaks don't require a re-crawl.

Signs that your knowledge base might be stale: the bot answers questions about old features or discontinued plans; it can't answer questions about something you know is on your website; customers say the bot gave them wrong information that was previously correct. Any of these is a trigger to re-crawl immediately.

Testing your chatbot's knowledge

After every crawl, run a structured knowledge test before you publicise the bot. Here's the method: write down the 10 most common questions you receive via email or your helpdesk. Ask each one to the bot in its exact form, then rephrase each one in two or three different ways. Grade each answer as correct, incomplete, or wrong.

For every failed or incomplete answer, your next task is clear: find where on your website that information should be, improve that content, and re-crawl. Chatbot failures are almost always a content audit in disguise. The questions your bot can't answer are exactly the questions your website doesn't explain clearly enough — and fixing those gaps helps both your bot and your SEO at the same time.

Frequently asked questions

Will the chatbot access pages behind a login?

No. The crawler can only access publicly available pages — the same pages any visitor or search engine bot could see. Password-protected content, member portals, and API endpoints are not crawled.

How many pages can I crawl?

It depends on your plan. The Starter plan crawls up to 50 pages, the Pro plan up to 500 pages, and the Business plan offers unlimited crawling.

What file formats does Sitepilot support besides web pages?

You can also paste raw text directly into the knowledge base editor, or upload PDF documents. This is useful for internal documentation, product specs, or pricing sheets that aren't published on your website.

How do I update the chatbot after I change my website?

From your bot's settings, click 'Re-crawl'. The system will re-index your pages and update the knowledge base within a few minutes. Old content is replaced automatically.

Add an AI chatbot to your website today

Train it on your content in minutes — no code, no engineering team needed.

Start free →