Crawler management and updates

Once configured, the crawler requires periodic management to keep your assistant's knowledge base updated and optimal.

Crawler states

Possible states
  • Ready: Configured and ready to be started
  • Processing: Currently in scanning phase
  • Completed: Last scan completed successfully
  • Error: Problems during last execution
  • Suspended: Temporarily disabled
Status information

For each crawler you can see:

  • Last execution: Date and time of last scan
  • Pages processed: Number of pages scanned in last execution
  • Successful pages: Pages processed correctly
  • Detected errors: Pages with problems
  • Next execution: When the next automatic scan is scheduled

Update modes

Automatic update

The system can automatically update content according to the set frequency:

  • Daily: Every 24 hours, ideal for sites with dynamic content
  • Weekly: Every 7 days, suitable for most sites
  • Monthly: Every 30 days, for more static content
  • Custom: Set specific intervals (e.g. every 3 days)
Manual update

You can always start a manual scan:

  • Click "Update now" in crawler management
  • Useful after important site updates
  • Doesn't interfere with automatic scheduling
  • Allows testing configuration changes

Performance monitoring

Metrics to check
  • Execution time: How long the complete scan takes
  • Success rate: Percentage of pages processed correctly
  • New pages found: Pages added since last scan
  • Modified pages: Content updated since last time
  • Recurring errors: Problems that repeat frequently
Result interpretation
  • 100% success: Everything works perfectly
  • 95-99% success: Normal, some errors are physiological
  • 80-94% success: Check the most frequent errors
  • Below 80%: Possible configuration or site problems

Error management

Common error types
  • 404 Not Found: Page not found (normal if page was removed)
  • 403 Forbidden: Access denied (possible site restriction)
  • 500 Server Error: Server error (temporary site problem)
  • Timeout: Page too slow to load
  • Empty content: Page without meaningful text
Corrective actions
  • 404 errors: Remove invalid URLs from configuration
  • Access errors: Check site security settings
  • Server errors: Usually resolve automatically
  • Recurring timeouts: Increase time limit or exclude slow pages

Performance optimization

Reducing server load
  • Request interval: Set pauses between page scans
  • Scan times: Schedule during low traffic hours
  • Simultaneous page limit: Don't overload the server
Quota management
  • Monitor usage: Check how many pages you've processed in the month
  • Prioritize content: Scan the most important pages first
  • Remove obsolete content: Remove pages no longer relevant

Periodic maintenance

Weekly checks
  • Check the status of all active crawlers
  • Check the number of errors in last execution
  • Monitor plan resource usage
Monthly checks
  • Review pages with most recurring errors
  • Evaluate adding new site sections
  • Remove crawlers no longer needed
  • Optimize inclusion/exclusion rules
Quarterly checks
  • Analyze long-term statistics
  • Verify alignment with assistant objectives
  • Consider changes to update frequency
  • Evaluate plan upgrade if necessary

Management best practices

  • Documentation: Keep track of changes and reasons
  • Test changes: Always test small before applying big changes
  • Configuration backup: Save settings before important changes
  • Proactive monitoring: Don't wait for problems, prevent them