Monitoring Crawled Pages

Detailed monitoring of crawled pages allows you to verify crawling quality and identify potential issues or optimization opportunities.

Accessing monitoring

  1. Go to the "Crawler" section in the control panel
  2. Select the crawler you want to monitor
  3. Click on "View pages" or "Scan details"
  4. Choose between "Processed pages" or "Pages with errors"

Available information for each page

Successfully processed pages

For each successfully scanned page you can see:

  • Full URL: The page address
  • Page title: The extracted <title> tag
  • Scan date: When it was last processed
  • Content size: Amount of extracted text
  • HTTP code: Usually 200 (success)
  • Processing time: How long the scan took
Pages with errors

For pages that generated errors:

  • Problematic URL: The address that caused the error
  • Error type: HTTP code (404, 500, etc.) or problem type
  • Error message: Detailed description of the problem
  • Last attempt date: When the error occurred
  • Number of attempts: How many times the system tried

Filters and search

Available filters
  • By status: Success only, errors only, or all
  • By date: Pages scanned in a specific period
  • By size: Pages with high or low content
  • By HTTP code: Filter by specific response codes
Text search
  • By URL: Find pages with specific URLs
  • By title: Search pages with particular titles
  • By content: Find pages containing certain words

Quality analysis

Indicators of useful pages
  • Substantial content: Pages with at least 200-300 words
  • Clear titles: Meaningful and descriptive title tags
  • HTML structure: Proper use of H1, H2, paragraphs
  • Unique content: Text not duplicated from other pages
Indicators of problematic pages
  • Poor content: Less than 50 words of text
  • Mainly navigation: Only menus and links, little content
  • Duplicate content: Identical to other already scanned pages
  • Recurring errors: Always problematic in access

Available actions

Single page management
  • Exclusion: Add URLs to exclusion rules
  • Forced re-scan: Force new processing
  • Content view: See extracted text
  • Direct opening: Visit the original page
Bulk actions
  • Multiple exclusion: Exclude multiple pages simultaneously
  • Group re-scan: Reprocess multiple pages
  • Export: Download URL lists for external analysis

Pattern interpretation

Positive patterns
  • Steady growth: Number of processed pages increases over time
  • Stable or decreasing errors: Indicates optimal configuration
  • Quality content: Most pages have substantial content
  • Complete coverage: All important sections are represented
Problematic patterns
  • Many 404 errors: Site has removed many pages
  • Recurring 403/500 errors: Possible site configuration issues
  • Generalized poor content: Crawler is taking non-useful pages
  • Stagnation: No new pages found for a long time

Data-based optimization

Rule improvement
  • Targeted exclusions: Exclude URL patterns that generate only poor content
  • Specific inclusions: Add rules to capture quality content
  • Depth balancing: Adjust depth based on results
Priority management
  • Key pages: Ensure most important pages are always scanned
  • Differentiated frequency: Scan more often sections that change frequently
  • Smart limit: Focus resources on most useful pages

Reports and export

Automatic reports
  • Weekly summary: Statistics of recent scans
  • Error alerts: Notifications when errors exceed thresholds
  • New content: List of newly found pages
Data export
  • Page CSV: Complete list for external analysis
  • Error reports: For technical problem resolution
  • Temporal statistics: Performance trends over time