Once configured, the crawler requires periodic management to keep your assistant's knowledge base updated and optimal.
Crawler states
Possible states
- Ready: Configured and ready to be started
- Processing: Currently in scanning phase
- Completed: Last scan completed successfully
- Error: Problems during last execution
- Suspended: Temporarily disabled
Status information
For each crawler you can see:
- Last execution: Date and time of last scan
- Pages processed: Number of pages scanned in last execution
- Successful pages: Pages processed correctly
- Detected errors: Pages with problems
- Next execution: When the next automatic scan is scheduled
Update modes
Automatic update
The system can automatically update content according to the set frequency:
- Daily: Every 24 hours, ideal for sites with dynamic content
- Weekly: Every 7 days, suitable for most sites
- Monthly: Every 30 days, for more static content
- Custom: Set specific intervals (e.g. every 3 days)
Manual update
You can always start a manual scan:
- Click "Update now" in crawler management
- Useful after important site updates
- Doesn't interfere with automatic scheduling
- Allows testing configuration changes
Performance monitoring
Metrics to check
- Execution time: How long the complete scan takes
- Success rate: Percentage of pages processed correctly
- New pages found: Pages added since last scan
- Modified pages: Content updated since last time
- Recurring errors: Problems that repeat frequently
Result interpretation
- 100% success: Everything works perfectly
- 95-99% success: Normal, some errors are physiological
- 80-94% success: Check the most frequent errors
- Below 80%: Possible configuration or site problems
Error management
Common error types
- 404 Not Found: Page not found (normal if page was removed)
- 403 Forbidden: Access denied (possible site restriction)
- 500 Server Error: Server error (temporary site problem)
- Timeout: Page too slow to load
- Empty content: Page without meaningful text
Corrective actions
- 404 errors: Remove invalid URLs from configuration
- Access errors: Check site security settings
- Server errors: Usually resolve automatically
- Recurring timeouts: Increase time limit or exclude slow pages
Performance optimization
Reducing server load
- Request interval: Set pauses between page scans
- Scan times: Schedule during low traffic hours
- Simultaneous page limit: Don't overload the server
Quota management
- Monitor usage: Check how many pages you've processed in the month
- Prioritize content: Scan the most important pages first
- Remove obsolete content: Remove pages no longer relevant
Periodic maintenance
Weekly checks
- Check the status of all active crawlers
- Check the number of errors in last execution
- Monitor plan resource usage
Monthly checks
- Review pages with most recurring errors
- Evaluate adding new site sections
- Remove crawlers no longer needed
- Optimize inclusion/exclusion rules
Quarterly checks
- Analyze long-term statistics
- Verify alignment with assistant objectives
- Consider changes to update frequency
- Evaluate plan upgrade if necessary
Management best practices
- Documentation: Keep track of changes and reasons
- Test changes: Always test small before applying big changes
- Configuration backup: Save settings before important changes
- Proactive monitoring: Don't wait for problems, prevent them