Core Module
Core types and utilities for resource identification and handling.
Headers
Bases: dict
HTTP headers container with case-insensitive key access.
This class extends the standard dictionary to provide case-insensitive access to HTTP header keys, conforming to HTTP specifications. It ensures proper type handling and provides flexible initialization from various header sources.
Inherits all dictionary attributes
Examples:
>>> headers = Headers({"Content-Type": "text/html"})
>>> headers["content-type"]
'text/html'
>>> headers["CONTENT-TYPE"] = "application/json"
>>> headers["content-type"]
'application/json'
>>> "content-type" in headers
True
>>> "CONTENT-TYPE" in headers
True
__contains__(key)
Check if header exists with case-insensitive comparison.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key
|
Header name to check (case-insensitive) |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if header exists, False otherwise |
__getitem__(key)
Get header value with case-insensitive key access.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key
|
Header name (case-insensitive) |
required |
Returns:
| Type | Description |
|---|---|
|
The header value |
Raises:
| Type | Description |
|---|---|
TypeError
|
If key is not a string |
KeyError
|
If header doesn't exist |
__init__(headers=None, **kwargs)
Initialize headers from a dictionary, dict-like object, or keyword arguments.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
headers
|
'Headers' | Mapping[str, Any] | None
|
Optional dictionary, Headers instance, or any dict-like object |
None
|
**kwargs
|
Optional keyword arguments to add as headers |
{}
|
Raises:
| Type | Description |
|---|---|
TypeError
|
If headers is not a dict-like object with an items() method |
__setitem__(key, value)
Set header value with case-insensitive key storage.
Setting a header to None will remove it from the collection. Non-string values are automatically converted to strings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key
|
str
|
Header name (will be converted to lowercase) |
required |
value
|
str | None
|
Header value, or None to remove the header |
required |
Raises:
| Type | Description |
|---|---|
TypeError
|
If key is not a string |
get(key, default=None)
Get header value with optional default.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key
|
Header name (case-insensitive) |
required | |
default
|
Value to return if header doesn't exist |
None
|
Returns:
| Type | Description |
|---|---|
str | None
|
Header value if it exists, otherwise the default value |
Resource
dataclass
URL-identified entity within the crawler system.
Resource is a generic representation of anything addressable by a URL within the Ethicrawl system. It serves as a common foundation for various components like requests, responses, robots.txt files, sitemap entries, etc.
This class provides URL type safety, consistent equality comparison, and proper hashing behavior for all URL-addressable entities.
Attributes:
| Name | Type | Description |
|---|---|---|
url |
Url
|
The Url object identifying this resource. Can be initialized with either a string or Url object. |
Raises:
| Type | Description |
|---|---|
TypeError
|
When initialized with something other than a string or Url object |
Examples:
>>> resource = Resource("https://example.com/robots.txt")
>>> resource.url.path
'/robots.txt'
>>> resource2 = Resource(Url("https://example.com/robots.txt"))
>>> resource == resource2
True
__eq__(other)
Compare resources for equality based on their URLs.
Two resources are considered equal if they have the same URL.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other
|
Any
|
Another Resource object to compare with |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if resources have the same URL, False otherwise |
__hash__()
Generate a hash based on the string representation of the URL.
Returns:
| Type | Description |
|---|---|
int
|
Integer hash value |
__post_init__()
Validate and normalize the url attribute after initialization.
Converts string URLs to Url objects and raises TypeError for invalid types.
__repr__()
Return a developer-friendly representation.
__str__()
Return the URL as a string for better readability.
ResourceList
Bases: Generic[T]
Collection of Resource objects with filtering capabilities.
ResourceList provides list-like functionality specialized for managing collections of Resources with additional filtering methods and type safety. The class is generic and can contain any subclass of Resource.
Note
This class has no public attributes as all storage is private.
Examples:
>>> from ethicrawl.core import Resource, ResourceList
>>> resources = ResourceList()
>>> resources.append(Resource("https://example.com/page1"))
>>> resources.append(Resource("https://example.com/page2"))
>>> len(resources)
2
>>> filtered = resources.filter(r"page1")
>>> len(filtered)
1
__getitem__(index)
Get items by index or slice.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
index
|
int | slice
|
Integer index or slice object |
required |
Returns:
| Type | Description |
|---|---|
T | ResourceList[T]
|
Single Resource when indexed with integer, ResourceList when sliced |
__init__(items=None)
Initialize a resource list with optional initial items.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
items
|
list[T] | None
|
Optional list of Resource objects to initialize with |
None
|
Raises:
| Type | Description |
|---|---|
TypeError
|
If items is not a list or contains non-Resource objects |
append(item)
Add a resource to the list.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
item
|
T
|
Resource object to add |
required |
Returns:
| Type | Description |
|---|---|
ResourceList[T]
|
Self for method chaining |
Raises:
| Type | Description |
|---|---|
TypeError
|
If item is not a Resource object |
extend(items)
Add multiple resources to the list.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
items
|
Iterable[T]
|
Iterable of Resource objects to add |
required |
Returns:
| Type | Description |
|---|---|
ResourceList[T]
|
Self for method chaining |
Raises:
| Type | Description |
|---|---|
TypeError
|
If any item is not a Resource object |
filter(pattern)
Filter resources by URL pattern.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pattern
|
str | Pattern
|
String pattern or compiled regex Pattern to match against URLs |
required |
Returns:
| Type | Description |
|---|---|
ResourceList[T]
|
New ResourceList containing only matching resources of the same type as original |
to_list()
Convert to a standard Python list.
Returns:
| Type | Description |
|---|---|
list[T]
|
A copy of the internal list of resources |
Url
URL parser and manipulation class.
This class provides methods for parsing, validating, and manipulating URLs. Supports HTTP, HTTPS, and file URL schemes with validation and component access. Path extension and query parameter manipulation are provided through the extend() method.
Attributes:
| Name | Type | Description |
|---|---|---|
scheme |
str
|
URL scheme (http, https, file) |
netloc |
str
|
Network location/domain (HTTP/HTTPS only) |
hostname |
str
|
Just the hostname portion of netloc (HTTP/HTTPS only) |
path |
str
|
URL path component |
params |
str
|
URL parameters (HTTP/HTTPS only) |
query |
str
|
Raw query string (HTTP/HTTPS only) |
query_params |
dict[str, Any]
|
Query string parsed into a dictionary (HTTP/HTTPS only) |
fragment |
str
|
URL fragment identifier (HTTP/HTTPS only) |
base |
str
|
Base URL (scheme + netloc) |
Raises:
| Type | Description |
|---|---|
ValueError
|
When provided with invalid URLs or when performing invalid operations |
base
property
Get the base URL (scheme and netloc).
Returns:
| Type | Description |
|---|---|
str
|
The base URL as a string (e.g., 'https://example.com') |
fragment
property
Get the fragment identifier from the URL.
The fragment appears after # in a URL and typically references a section within a document.
Returns:
| Type | Description |
|---|---|
str
|
Fragment string without the # character |
Raises:
| Type | Description |
|---|---|
ValueError
|
If called on a non-HTTP(S) URL |
hostname
property
Get just the hostname part.
netloc
property
Get the network location (domain).
params
property
Get URL parameters.
path
property
Get the path component.
query
property
Get the query string.
query_params
property
Get query parameters as a dictionary.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary of query parameter keys and values |
Raises:
| Type | Description |
|---|---|
ValueError
|
If called on a non-HTTP(S) URL |
scheme
property
Get the URL scheme (file, http or https).
__eq__(other)
Compare URLs for equality.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other
|
Any
|
Another Url object or string to compare with |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if URLs are equal, False otherwise |
__hash__()
Return a hash of the URL.
The hash is based on the string representation of the URL, ensuring URLs that are equal have the same hash.
Returns:
| Type | Description |
|---|---|
int
|
Integer hash value |
__init__(url, validate=False)
Initialize a URL object with parsing and optional validation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
Union[str, Url]
|
String or Url object to parse |
required |
validate
|
bool
|
If True, performs additional validation including DNS resolution |
False
|
Raises:
| Type | Description |
|---|---|
ValueError
|
When the URL has an invalid scheme or missing required components |
ValueError
|
When validate=True and the hostname cannot be resolved |
__str__()
Convert URL to string representation.
Returns:
| Type | Description |
|---|---|
str
|
Complete URL string |
extend(*args)
Extend the URL with additional path components or query parameters.
This method supports multiple extension patterns: 1. Path extension: extend("path/component") 2. Single parameter: extend("param_name", "param_value") 3. Multiple parameters: extend({"param1": "value1", "param2": "value2"})
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
*args
|
Any
|
Either a path string, a parameter dict, or name/value parameter pair |
()
|
Returns:
| Type | Description |
|---|---|
Url
|
A new Url object with the extended path or parameters |
Raises:
| Type | Description |
|---|---|
ValueError
|
If arguments don't match one of the supported patterns |
ValueError
|
If trying to add query parameters to a file:// URL |
Examples:
>>> url = Url("https://example.com/api")
>>> url.extend("v1").extend({"format": "json"})
Url("https://example.com/api/v1?format=json")