What is a crawler

A crawler (or "web scraper") is an automatic tool that navigates and analyzes the pages of your website to extract content and information. This content is then used by the assistant to answer user questions.

How a crawler works

The crawling process happens in several phases:

Initial scan: The crawler visits the URL you specified
Page analysis: Extracts the textual content of the page
Link discovery: Identifies other internal links to follow
Recursive navigation: Visits linked pages following the rules you set
Content processing: Cleans and organizes the extracted text
Database update: Saves content to the knowledge base

Benefits of automatic crawling

Automatic updates: Content is updated periodically without manual intervention
Completeness: Can analyze hundreds of pages in a short time
Consistency: Keeps site content synchronized with assistant content
Efficiency: You don't have to manually upload every single page

Types of content the crawler can extract

Main text of web pages
Titles and subtitles (H1, H2, H3...)
Paragraph and list content
Page meta descriptions
Article and blog post content
FAQ and support sections

Crawling limitations

Public content only: The crawler can only access publicly visible pages
Static content: Cannot extract content dynamically generated by complex JavaScript
Respect limits: Follows your site's robots.txt file rules
Text only: Does not process images, videos or other media

When to use crawlers

Ideal for:

Websites with many informational pages
Blogs and news sections
Online technical documentation
Product catalogs
FAQ and support sections

Less suitable for: