What is a crawler

A crawler (or "web scraper") is an automatic tool that navigates and analyzes the pages of your website to extract content and information. This content is then used by the assistant to answer user questions.

How a crawler works

The crawling process happens in several phases:

  1. Initial scan: The crawler visits the URL you specified
  2. Page analysis: Extracts the textual content of the page
  3. Link discovery: Identifies other internal links to follow
  4. Recursive navigation: Visits linked pages following the rules you set
  5. Content processing: Cleans and organizes the extracted text
  6. Database update: Saves content to the knowledge base

Benefits of automatic crawling

  • Automatic updates: Content is updated periodically without manual intervention
  • Completeness: Can analyze hundreds of pages in a short time
  • Consistency: Keeps site content synchronized with assistant content
  • Efficiency: You don't have to manually upload every single page

Types of content the crawler can extract

  • Main text of web pages
  • Titles and subtitles (H1, H2, H3...)
  • Paragraph and list content
  • Page meta descriptions
  • Article and blog post content
  • FAQ and support sections

Crawling limitations

  • Public content only: The crawler can only access publicly visible pages
  • Static content: Cannot extract content dynamically generated by complex JavaScript
  • Respect limits: Follows your site's robots.txt file rules
  • Text only: Does not process images, videos or other media

When to use crawlers

Ideal for:

  • Websites with many informational pages
  • Blogs and news sections
  • Online technical documentation
  • Product catalogs
  • FAQ and support sections

Less suitable for:

  • Sites with predominantly multimedia content
  • Dynamic web applications
  • Login-protected content
  • Sites with very complex structure