Detailed monitoring of crawled pages allows you to verify crawling quality and identify potential issues or optimization opportunities.
Accessing monitoring
- Go to the "Crawler" section in the control panel
- Select the crawler you want to monitor
- Click on "View pages" or "Scan details"
- Choose between "Processed pages" or "Pages with errors"
Available information for each page
Successfully processed pages
For each successfully scanned page you can see:
- Full URL: The page address
- Page title: The extracted <title> tag
- Scan date: When it was last processed
- Content size: Amount of extracted text
- HTTP code: Usually 200 (success)
- Processing time: How long the scan took
Pages with errors
For pages that generated errors:
- Problematic URL: The address that caused the error
- Error type: HTTP code (404, 500, etc.) or problem type
- Error message: Detailed description of the problem
- Last attempt date: When the error occurred
- Number of attempts: How many times the system tried
Filters and search
Available filters
- By status: Success only, errors only, or all
- By date: Pages scanned in a specific period
- By size: Pages with high or low content
- By HTTP code: Filter by specific response codes
Text search
- By URL: Find pages with specific URLs
- By title: Search pages with particular titles
- By content: Find pages containing certain words
Quality analysis
Indicators of useful pages
- Substantial content: Pages with at least 200-300 words
- Clear titles: Meaningful and descriptive title tags
- HTML structure: Proper use of H1, H2, paragraphs
- Unique content: Text not duplicated from other pages
Indicators of problematic pages
- Poor content: Less than 50 words of text
- Mainly navigation: Only menus and links, little content
- Duplicate content: Identical to other already scanned pages
- Recurring errors: Always problematic in access
Available actions
Single page management
- Exclusion: Add URLs to exclusion rules
- Forced re-scan: Force new processing
- Content view: See extracted text
- Direct opening: Visit the original page
Bulk actions
- Multiple exclusion: Exclude multiple pages simultaneously
- Group re-scan: Reprocess multiple pages
- Export: Download URL lists for external analysis
Pattern interpretation
Positive patterns
- Steady growth: Number of processed pages increases over time
- Stable or decreasing errors: Indicates optimal configuration
- Quality content: Most pages have substantial content
- Complete coverage: All important sections are represented
Problematic patterns
- Many 404 errors: Site has removed many pages
- Recurring 403/500 errors: Possible site configuration issues
- Generalized poor content: Crawler is taking non-useful pages
- Stagnation: No new pages found for a long time
Data-based optimization
Rule improvement
- Targeted exclusions: Exclude URL patterns that generate only poor content
- Specific inclusions: Add rules to capture quality content
- Depth balancing: Adjust depth based on results
Priority management
- Key pages: Ensure most important pages are always scanned
- Differentiated frequency: Scan more often sections that change frequently
- Smart limit: Focus resources on most useful pages
Reports and export
Automatic reports
- Weekly summary: Statistics of recent scans
- Error alerts: Notifications when errors exceed thresholds
- New content: List of newly found pages
Data export
- Page CSV: Complete list for external analysis
- Error reports: For technical problem resolution
- Temporal statistics: Performance trends over time