Config Module
Configuration system for Ethicrawl.
BaseConfig
Bases: ABC
Abstract base class for configuration components.
All configuration classes inherit from this class to ensure a consistent interface and behavior across the configuration system. Configuration objects can be converted to dictionaries, serialized, and represented as strings with consistent formatting.
Example
from abc import ABC from ethicrawl.config import BaseConfig
class MyConfig(BaseConfig): ... def init(self, name="default", value=42): ... self.name = name ... self.value = value ... ... def to_dict(self) -> dict: ... return {"name": self.name, "value": self.value}
config = MyConfig("test", 100) config.to_dict() {'name': 'test', 'value': 100} print(config) { "name": "test", "value": 100 }
__repr__()
Default string representation showing config values.
Returns:
| Type | Description |
|---|---|
str
|
String in format ClassName({config values}) |
__str__()
Human-readable string representation.
Returns:
| Type | Description |
|---|---|
str
|
Pretty-printed JSON representation of the configuration |
to_dict()
abstractmethod
Convert configuration to a dictionary representation.
Implementations must produce a JSON-serializable dictionary that fully represents the configuration state.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary representation of the configuration |
Config
dataclass
Global configuration singleton for Ethicrawl.
This class provides a centralized, thread-safe configuration system for all components of Ethicrawl. It implements the Singleton pattern to ensure consistent settings throughout the application.
The configuration is organized into sections (http, logger, sitemap) with each section containing component-specific settings.
Thread Safety
All configuration updates are protected by a reentrant lock, ensuring thread-safe operation in multi-threaded crawling scenarios.
Integration Features
- Convert to/from dictionaries for integration with external config systems
- JSON serialization for storage or transmission
- Hierarchical structure matches common config formats
Attributes:
| Name | Type | Description |
|---|---|---|
http |
HttpConfig
|
HTTP-specific configuration (user agent, headers, timeout) |
logger |
LoggerConfig
|
Logging configuration (levels, format, output) |
sitemap |
SitemapConfig
|
Sitemap parsing configuration (limits, defaults) |
Example
from ethicrawl.config import Config config = Config() # Get the global instance config.http.user_agent = "MyCustomBot/1.0" config.logger.level = "DEBUG"
Thread-safe update of multiple settings at once
config.update({ ... "http": {"timeout": 30}, ... "logger": {"component_levels": {"robots": "DEBUG"}} ... })
Get a snapshot for thread-safe reading
snapshot = config.get_snapshot() print(snapshot.http.timeout) 30
Export config for integration with external systems
config_dict = config.to_dict() config_json = str(config)
__str__()
Format the configuration as a JSON string.
Returns:
| Type | Description |
|---|---|
str
|
Formatted JSON representation of the configuration |
get_snapshot()
Create a thread-safe deep copy of the current configuration.
Returns:
| Type | Description |
|---|---|
Config
|
A deep copy of the current Config object |
reset()
classmethod
Reset the singleton instance to default values.
Removes the existing instance from the singleton registry, causing a new instance to be created on next access.
Example
Config.reset() # Reset to defaults config = Config() # Get fresh instance
to_dict()
Convert the configuration to a dictionary.
Returns:
| Type | Description |
|---|---|
dict
|
A nested dictionary representing all configuration sections |
update(config_dict)
Update configuration from a dictionary.
Updates configuration sections based on a nested dictionary structure. The dictionary should have section names as top-level keys and property-value pairs as nested dictionaries.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config_dict
|
dict[str, Any]
|
Dictionary with configuration settings |
required |
Raises:
| Type | Description |
|---|---|
AttributeError
|
If trying to set a property that doesn't exist |
Example
config.update({ ... "http": { ... "user_agent": "CustomBot/1.0", ... "timeout": 30 ... }, ... "logger": { ... "level": "DEBUG" ... } ... })
HttpConfig
dataclass
Bases: BaseConfig
HTTP client configuration settings for Ethicrawl.
This class manages all HTTP-specific configuration options including timeouts, rate limiting, retries, user agent settings, headers, and proxy configuration. It provides validation for all values to ensure they're within safe and reasonable ranges.
All setters perform type checking and value validation to prevent invalid configurations. The class integrates with the global Config singleton for system-wide settings.
Attributes:
| Name | Type | Description |
|---|---|---|
timeout |
float
|
Request timeout in seconds (default: 30.0) |
max_retries |
int
|
Maximum retry attempts for failed requests (default: 3) |
retry_delay |
float
|
Base delay between retries in seconds (default: 1.0) |
rate_limit |
float | None
|
Maximum requests per second (default: 0.5) |
jitter |
float
|
Random variation factor for rate limiting (default: 0.2) |
user_agent |
str
|
User agent string for requests (default: "Ethicrawl/1.0") |
headers |
Headers
|
Default headers to include with requests |
proxies |
HttpProxyConfig
|
Proxy server configuration |
Example
from ethicrawl.config import Config
Get the global configuration
config = Config()
Update HTTP settings
config.http.timeout = 60.0 config.http.user_agent = "MyCustomCrawler/2.0" config.http.rate_limit = 1.0 # 1 request per second
Configure proxy
config.http.proxies = {"http": "http://proxy:8080", "https": "https://proxy:8443"}
headers
property
writable
Get request headers.
jitter
property
writable
Random variation factor for rate limiting.
Adds randomness to the timing between requests to make crawling patterns less predictable. The random factor is calculated as: delay * (1 + random() * jitter)
Valid range: 0.0-1.0 Default: 0.2
Raises:
| Type | Description |
|---|---|
TypeError
|
If value is not a number |
ValueError
|
If value outside allowed range |
max_retries
property
writable
Maximum number of retry attempts for failed requests.
Controls how many times a failed request should be retried before giving up. Uses exponential backoff between attempts.
Valid range: 0-10 (0 means no retries) Default: 3
Raises:
| Type | Description |
|---|---|
TypeError
|
If value is not an integer |
ValueError
|
If value is negative or > 10 |
proxies
property
writable
Proxy server configuration for HTTP requests.
Configures HTTP and HTTPS proxy servers for requests.
Example
config.http.proxies = { ... "http": "http://proxy:8080", ... "https": "https://proxy:8443" ... }
Returns:
| Type | Description |
|---|---|
HttpProxyConfig
|
HttpProxyConfig object with http and https properties |
Raises:
| Type | Description |
|---|---|
TypeError
|
If value is not HttpProxyConfig or dict |
rate_limit
property
writable
Maximum requests per second allowed.
Controls request frequency to avoid overwhelming servers. Set to None to disable rate limiting (not recommended).
Example: 0.5 means maximum of one request every 2 seconds
Valid range: > 0 Default: 0.5
Raises:
| Type | Description |
|---|---|
TypeError
|
If value is not a number |
ValueError
|
If value is <= 0 |
retry_delay
property
writable
Base delay between retries in seconds
timeout
property
writable
Request timeout in seconds.
Controls how long to wait for a response before abandoning the request.
Valid range: 0 < timeout <= 300 Default: 30.0
Raises:
| Type | Description |
|---|---|
TypeError
|
If value is not a number |
ValueError
|
If value is <= 0 or > 300 |
user_agent
property
writable
User agent string
to_dict()
Convert config to dictionary.
HttpProxyConfig
dataclass
Bases: BaseConfig
HTTP proxy configuration settings.
Manages proxy server URLs for HTTP and HTTPS connections. Both proxy types can be configured independently and are validated to ensure they contain valid URLs.
Attributes:
| Name | Type | Description |
|---|---|---|
http |
Url | None
|
HTTP proxy server URL |
https |
Url | None
|
HTTPS proxy server URL |
Example
from ethicrawl.config import Config config = Config()
Configure HTTP proxy
config.http.proxies.http = "http://proxy.example.com:8080"
Configure HTTPS proxy
config.http.proxies.https = "http://secure-proxy.example.com:8443"
Clear HTTP proxy
config.http.proxies.http = None
http
property
writable
https
property
writable
to_dict()
Convert proxy configuration to dictionary format.
Returns:
| Type | Description |
|---|---|
dict
|
Dict with 'http' and 'https' keys mapping to URL strings or None |
LoggerConfig
dataclass
Bases: BaseConfig
Logging configuration for Ethicrawl.
Controls all aspects of Ethicrawl's logging system including output destinations, format, levels, and component-specific settings.
The class provides validation for all settings and supports both numeric and string-based log levels (e.g., "DEBUG" or logging.DEBUG).
Attributes:
| Name | Type | Description |
|---|---|---|
level |
int
|
Default log level for all components (default: INFO) |
console_enabled |
bool
|
Whether to log to console (default: True) |
file_enabled |
bool
|
Whether to write logs to a file (default: False) |
file_path |
str | None
|
Path where log file should be written (default: None) |
use_colors |
bool
|
Whether to use colored console output (default: True) |
format |
str
|
Log message format string |
component_levels |
dict[str, int]
|
Dictionary of component-specific log levels |
Example
from ethicrawl.config import Config config = Config()
Set global log level
config.logger.level = "DEBUG"
Enable file logging
config.logger.file_enabled = True config.logger.file_path = "ethicrawl.log"
Set component-specific level
config.logger.set_component_level("robots", "DEBUG") config.logger.set_component_level("http", "WARNING")
component_levels
property
Special log levels for specific components.
Returns a dictionary mapping component names to log levels. Note: Returns a copy to prevent direct mutation.
Example
config.logger.component_levels {'robots': 10, 'http': 30}
console_enabled
property
writable
Whether to log to console/stdout.
When True, log messages will be printed to the console. Default: True
Raises:
| Type | Description |
|---|---|
TypeError
|
If value is not a boolean |
file_enabled
property
writable
Whether to log to a file.
When True, log messages will be written to the file specified by file_path (which must also be set). Default: False
Raises:
| Type | Description |
|---|---|
TypeError
|
If value is not a boolean |
file_path
property
writable
Path to log file.
Only used when file_enabled is True. Default: None
Raises:
| Type | Description |
|---|---|
TypeError
|
If value is not a string or None |
format
property
writable
Log message format string.
Uses Python's logging format string syntax. Default: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
Raises:
| Type | Description |
|---|---|
TypeError
|
If value is not a string |
ValueError
|
If value is empty |
level
property
writable
Default log level for all loggers.
Can be set using either integer constants from the logging module or string names like "DEBUG", "INFO", etc.
Default: logging.INFO (20)
Raises:
| Type | Description |
|---|---|
TypeError
|
If value is not an integer or string |
ValueError
|
If value is an invalid level |
use_colors
property
writable
Whether to use colorized output for console logging.
When True, different log levels will be displayed in different colors in terminal output for better readability. Default: True
Raises:
| Type | Description |
|---|---|
TypeError
|
If value is not a boolean |
set_component_level(component_name, level)
Set a specific log level for a component.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
component_name
|
str
|
The component name (e.g., "robots", "sitemaps") |
required |
level
|
int | str
|
The log level (can be int or level name string) |
required |
Raises:
| Type | Description |
|---|---|
TypeError
|
If level is not an integer or string |
ValueError
|
If string level name is not valid |
Example
config.logger.set_component_level("robots", "DEBUG") config.logger.set_component_level("http", logging.WARNING)
to_dict()
Convert logger configuration to a dictionary.
SitemapConfig
dataclass
Bases: BaseConfig
Configuration for sitemap parsing and traversal.
Controls behavior of sitemap parsing including recursion limits, error handling, and filtering options. This configuration affects how sitemaps are discovered, parsed, and which URLs are included in the final results.
Attributes:
| Name | Type | Description |
|---|---|---|
max_depth |
int
|
Maximum recursion depth for nested sitemaps (default: 5) |
follow_external |
bool
|
Whether to follow sitemap links to external domains (default: False) |
validate_urls |
bool
|
Whether to validate URLs before adding them to results (default: True) |
Example
from ethicrawl.config import Config config = Config()
Increase recursion depth for complex sites
config.sitemap.max_depth = 10
Allow following external domains
config.sitemap.follow_external = True
follow_external
property
writable
Whether to follow sitemap links to external domains.
When True, sitemap references to other domains will be followed. When False, only sitemaps on the same domain will be processed.
Default: False
Raises:
| Type | Description |
|---|---|
TypeError
|
If value is not a boolean |
max_depth
property
writable
Maximum recursion depth for nested sitemaps.
Controls how many levels of sitemap references will be followed. Many sites use sitemap indexes that point to other sitemaps, and this setting limits how deep that recursion can go.
Valid range: >= 1 Default: 5
Raises:
| Type | Description |
|---|---|
TypeError
|
If value is not an integer |
ValueError
|
If value is less than 1 |
validate_urls
property
writable
Whether to validate URLs before adding them to results.
When True, each URL found in sitemaps will be validated for proper format before being included in results. This helps filter out malformed URLs but adds some processing overhead.
Default: True
Raises:
| Type | Description |
|---|---|
TypeError
|
If value is not a boolean |
to_dict()
Convert configuration to a dictionary.
Returns:
| Type | Description |
|---|---|
dict
|
Dictionary with all sitemap configuration values |