Overview

Ethicrawl is organized into several key modules:

Ethicrawl: Provides the main interface for operating with the library.
Core: Fundamental components including URL handling, resource management, and standardized data structures. Provides the Resource, ResourceList, and Url classes that are the building blocks of crawling operations.
Client: HTTP clients for various access methods:
HttpClient: Standard requests-based client for most web resources
ChromeClient: Browser automation for JavaScript-heavy sites using Selenium
Context: Manages the crawling environment including domain boundaries, allowed paths, and session state. Enforces ethical boundaries during execution.
Sitemaps: Tools for parsing and utilizing XML sitemaps. Provides classes for handling IndexEntry, UrlsetEntry, and other sitemap standard formats.
Robots: Robots.txt parsing and rule enforcement. Ensures crawlers respect website policies regarding crawlable content.
Config: Configuration management for controlling crawler behavior. Provides a singleton Config class for global settings and per-instance overrides.
Logger: Customizable logging functionality with sensible defaults for tracking crawler operations.
Error: Specialized exception classes for error handling, including DomainWhitelistError, RobotDisallowedError, and other ethical boundary violations.

Getting Started

The typical usage flow follows this pattern:

Create an Ethicrawl instance
Bind to a primary domain
Access content with the built-in HTTP methods

from ethicrawl import Ethicrawl
from ethicrawl.error import RobotDisallowedError

# Create and bind to a domain
ethicrawl = Ethicrawl()
ethicrawl.bind("https://example.com")

# Get a page - robots.txt rules automatically respected
try:
    response = ethicrawl.get("https://example.com/page.html")
except RobotDisallowedError:
    print("The site prohibits fetching the page")

# Release resources when done
ethicrawl.unbind()

For detailed usage examples, see the examples directory in the repository.