Skip to content

Config Module

Configuration system for Ethicrawl.

BaseConfig

Bases: ABC

Abstract base class for configuration components.

All configuration classes inherit from this class to ensure a consistent interface and behavior across the configuration system. Configuration objects can be converted to dictionaries, serialized, and represented as strings with consistent formatting.

Example

from abc import ABC from ethicrawl.config import BaseConfig

class MyConfig(BaseConfig): ... def init(self, name="default", value=42): ... self.name = name ... self.value = value ... ... def to_dict(self) -> dict: ... return {"name": self.name, "value": self.value}

config = MyConfig("test", 100) config.to_dict() {'name': 'test', 'value': 100} print(config) { "name": "test", "value": 100 }

__repr__()

Default string representation showing config values.

Returns:

Type Description
str

String in format ClassName({config values})

__str__()

Human-readable string representation.

Returns:

Type Description
str

Pretty-printed JSON representation of the configuration

to_dict() abstractmethod

Convert configuration to a dictionary representation.

Implementations must produce a JSON-serializable dictionary that fully represents the configuration state.

Returns:

Type Description
dict[str, Any]

Dictionary representation of the configuration

Config dataclass

Global configuration singleton for Ethicrawl.

This class provides a centralized, thread-safe configuration system for all components of Ethicrawl. It implements the Singleton pattern to ensure consistent settings throughout the application.

The configuration is organized into sections (http, logger, sitemap) with each section containing component-specific settings.

Thread Safety

All configuration updates are protected by a reentrant lock, ensuring thread-safe operation in multi-threaded crawling scenarios.

Integration Features
  • Convert to/from dictionaries for integration with external config systems
  • JSON serialization for storage or transmission
  • Hierarchical structure matches common config formats

Attributes:

Name Type Description
http HttpConfig

HTTP-specific configuration (user agent, headers, timeout)

logger LoggerConfig

Logging configuration (levels, format, output)

sitemap SitemapConfig

Sitemap parsing configuration (limits, defaults)

Example

from ethicrawl.config import Config config = Config() # Get the global instance config.http.user_agent = "MyCustomBot/1.0" config.logger.level = "DEBUG"

Thread-safe update of multiple settings at once

config.update({ ... "http": {"timeout": 30}, ... "logger": {"component_levels": {"robots": "DEBUG"}} ... })

Get a snapshot for thread-safe reading

snapshot = config.get_snapshot() print(snapshot.http.timeout) 30

Export config for integration with external systems

config_dict = config.to_dict() config_json = str(config)

__str__()

Format the configuration as a JSON string.

Returns:

Type Description
str

Formatted JSON representation of the configuration

get_snapshot()

Create a thread-safe deep copy of the current configuration.

Returns:

Type Description
Config

A deep copy of the current Config object

reset() classmethod

Reset the singleton instance to default values.

Removes the existing instance from the singleton registry, causing a new instance to be created on next access.

Example

Config.reset() # Reset to defaults config = Config() # Get fresh instance

to_dict()

Convert the configuration to a dictionary.

Returns:

Type Description
dict

A nested dictionary representing all configuration sections

update(config_dict)

Update configuration from a dictionary.

Updates configuration sections based on a nested dictionary structure. The dictionary should have section names as top-level keys and property-value pairs as nested dictionaries.

Parameters:

Name Type Description Default
config_dict dict[str, Any]

Dictionary with configuration settings

required

Raises:

Type Description
AttributeError

If trying to set a property that doesn't exist

Example

config.update({ ... "http": { ... "user_agent": "CustomBot/1.0", ... "timeout": 30 ... }, ... "logger": { ... "level": "DEBUG" ... } ... })

HttpConfig dataclass

Bases: BaseConfig

HTTP client configuration settings for Ethicrawl.

This class manages all HTTP-specific configuration options including timeouts, rate limiting, retries, user agent settings, headers, and proxy configuration. It provides validation for all values to ensure they're within safe and reasonable ranges.

All setters perform type checking and value validation to prevent invalid configurations. The class integrates with the global Config singleton for system-wide settings.

Attributes:

Name Type Description
timeout float

Request timeout in seconds (default: 30.0)

max_retries int

Maximum retry attempts for failed requests (default: 3)

retry_delay float

Base delay between retries in seconds (default: 1.0)

rate_limit float | None

Maximum requests per second (default: 0.5)

jitter float

Random variation factor for rate limiting (default: 0.2)

user_agent str

User agent string for requests (default: "Ethicrawl/1.0")

headers Headers

Default headers to include with requests

proxies HttpProxyConfig

Proxy server configuration

Example

from ethicrawl.config import Config

Get the global configuration

config = Config()

Update HTTP settings

config.http.timeout = 60.0 config.http.user_agent = "MyCustomCrawler/2.0" config.http.rate_limit = 1.0 # 1 request per second

Configure proxy

config.http.proxies = {"http": "http://proxy:8080", "https": "https://proxy:8443"}

headers property writable

Get request headers.

jitter property writable

Random variation factor for rate limiting.

Adds randomness to the timing between requests to make crawling patterns less predictable. The random factor is calculated as: delay * (1 + random() * jitter)

Valid range: 0.0-1.0 Default: 0.2

Raises:

Type Description
TypeError

If value is not a number

ValueError

If value outside allowed range

max_retries property writable

Maximum number of retry attempts for failed requests.

Controls how many times a failed request should be retried before giving up. Uses exponential backoff between attempts.

Valid range: 0-10 (0 means no retries) Default: 3

Raises:

Type Description
TypeError

If value is not an integer

ValueError

If value is negative or > 10

proxies property writable

Proxy server configuration for HTTP requests.

Configures HTTP and HTTPS proxy servers for requests.

Example

config.http.proxies = { ... "http": "http://proxy:8080", ... "https": "https://proxy:8443" ... }

Returns:

Type Description
HttpProxyConfig

HttpProxyConfig object with http and https properties

Raises:

Type Description
TypeError

If value is not HttpProxyConfig or dict

rate_limit property writable

Maximum requests per second allowed.

Controls request frequency to avoid overwhelming servers. Set to None to disable rate limiting (not recommended).

Example: 0.5 means maximum of one request every 2 seconds

Valid range: > 0 Default: 0.5

Raises:

Type Description
TypeError

If value is not a number

ValueError

If value is <= 0

retry_delay property writable

Base delay between retries in seconds

timeout property writable

Request timeout in seconds.

Controls how long to wait for a response before abandoning the request.

Valid range: 0 < timeout <= 300 Default: 30.0

Raises:

Type Description
TypeError

If value is not a number

ValueError

If value is <= 0 or > 300

user_agent property writable

User agent string

to_dict()

Convert config to dictionary.

HttpProxyConfig dataclass

Bases: BaseConfig

HTTP proxy configuration settings.

Manages proxy server URLs for HTTP and HTTPS connections. Both proxy types can be configured independently and are validated to ensure they contain valid URLs.

Attributes:

Name Type Description
http Url | None

HTTP proxy server URL

https Url | None

HTTPS proxy server URL

Example

from ethicrawl.config import Config config = Config()

Configure HTTP proxy

config.http.proxies.http = "http://proxy.example.com:8080"

Configure HTTPS proxy

config.http.proxies.https = "http://secure-proxy.example.com:8443"

Clear HTTP proxy

config.http.proxies.http = None

http property writable

HTTP proxy server URL.

Returns:

Type Description
Url | None

Url object or None if not configured

https property writable

HTTPS proxy server URL.

Returns:

Type Description
Url | None

Url object or None if not configured

to_dict()

Convert proxy configuration to dictionary format.

Returns:

Type Description
dict

Dict with 'http' and 'https' keys mapping to URL strings or None

LoggerConfig dataclass

Bases: BaseConfig

Logging configuration for Ethicrawl.

Controls all aspects of Ethicrawl's logging system including output destinations, format, levels, and component-specific settings.

The class provides validation for all settings and supports both numeric and string-based log levels (e.g., "DEBUG" or logging.DEBUG).

Attributes:

Name Type Description
level int

Default log level for all components (default: INFO)

console_enabled bool

Whether to log to console (default: True)

file_enabled bool

Whether to write logs to a file (default: False)

file_path str | None

Path where log file should be written (default: None)

use_colors bool

Whether to use colored console output (default: True)

format str

Log message format string

component_levels dict[str, int]

Dictionary of component-specific log levels

Example

from ethicrawl.config import Config config = Config()

Set global log level

config.logger.level = "DEBUG"

Enable file logging

config.logger.file_enabled = True config.logger.file_path = "ethicrawl.log"

Set component-specific level

config.logger.set_component_level("robots", "DEBUG") config.logger.set_component_level("http", "WARNING")

component_levels property

Special log levels for specific components.

Returns a dictionary mapping component names to log levels. Note: Returns a copy to prevent direct mutation.

Example

config.logger.component_levels {'robots': 10, 'http': 30}

console_enabled property writable

Whether to log to console/stdout.

When True, log messages will be printed to the console. Default: True

Raises:

Type Description
TypeError

If value is not a boolean

file_enabled property writable

Whether to log to a file.

When True, log messages will be written to the file specified by file_path (which must also be set). Default: False

Raises:

Type Description
TypeError

If value is not a boolean

file_path property writable

Path to log file.

Only used when file_enabled is True. Default: None

Raises:

Type Description
TypeError

If value is not a string or None

format property writable

Log message format string.

Uses Python's logging format string syntax. Default: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"

Raises:

Type Description
TypeError

If value is not a string

ValueError

If value is empty

level property writable

Default log level for all loggers.

Can be set using either integer constants from the logging module or string names like "DEBUG", "INFO", etc.

Default: logging.INFO (20)

Raises:

Type Description
TypeError

If value is not an integer or string

ValueError

If value is an invalid level

use_colors property writable

Whether to use colorized output for console logging.

When True, different log levels will be displayed in different colors in terminal output for better readability. Default: True

Raises:

Type Description
TypeError

If value is not a boolean

set_component_level(component_name, level)

Set a specific log level for a component.

Parameters:

Name Type Description Default
component_name str

The component name (e.g., "robots", "sitemaps")

required
level int | str

The log level (can be int or level name string)

required

Raises:

Type Description
TypeError

If level is not an integer or string

ValueError

If string level name is not valid

Example

config.logger.set_component_level("robots", "DEBUG") config.logger.set_component_level("http", logging.WARNING)

to_dict()

Convert logger configuration to a dictionary.

SitemapConfig dataclass

Bases: BaseConfig

Configuration for sitemap parsing and traversal.

Controls behavior of sitemap parsing including recursion limits, error handling, and filtering options. This configuration affects how sitemaps are discovered, parsed, and which URLs are included in the final results.

Attributes:

Name Type Description
max_depth int

Maximum recursion depth for nested sitemaps (default: 5)

follow_external bool

Whether to follow sitemap links to external domains (default: False)

validate_urls bool

Whether to validate URLs before adding them to results (default: True)

Example

from ethicrawl.config import Config config = Config()

Increase recursion depth for complex sites

config.sitemap.max_depth = 10

Allow following external domains

config.sitemap.follow_external = True

follow_external property writable

Whether to follow sitemap links to external domains.

When True, sitemap references to other domains will be followed. When False, only sitemaps on the same domain will be processed.

Default: False

Raises:

Type Description
TypeError

If value is not a boolean

max_depth property writable

Maximum recursion depth for nested sitemaps.

Controls how many levels of sitemap references will be followed. Many sites use sitemap indexes that point to other sitemaps, and this setting limits how deep that recursion can go.

Valid range: >= 1 Default: 5

Raises:

Type Description
TypeError

If value is not an integer

ValueError

If value is less than 1

validate_urls property writable

Whether to validate URLs before adding them to results.

When True, each URL found in sitemaps will be validated for proper format before being included in results. This helps filter out malformed URLs but adds some processing overhead.

Default: True

Raises:

Type Description
TypeError

If value is not a boolean

to_dict()

Convert configuration to a dictionary.

Returns:

Type Description
dict

Dictionary with all sitemap configuration values