Core Module

Core types and utilities for resource identification and handling.

`Headers`

Bases: dict

HTTP headers container with case-insensitive key access.

This class extends the standard dictionary to provide case-insensitive access to HTTP header keys, conforming to HTTP specifications. It ensures proper type handling and provides flexible initialization from various header sources.

Inherits all dictionary attributes

Examples:

>>> headers = Headers({"Content-Type": "text/html"})
>>> headers["content-type"]
'text/html'
>>> headers["CONTENT-TYPE"] = "application/json"
>>> headers["content-type"]
'application/json'
>>> "content-type" in headers
True
>>> "CONTENT-TYPE" in headers
True

`contains(key)`

Check if header exists with case-insensitive comparison.

Parameters:

Name	Type	Description	Default
`key`		Header name to check (case-insensitive)	required

Returns:

Type	Description
`bool`	True if header exists, False otherwise

`getitem(key)`

Get header value with case-insensitive key access.

Parameters:

Name	Type	Description	Default
`key`		Header name (case-insensitive)	required

Returns:

Type	Description
	The header value

Raises:

Type	Description
`TypeError`	If key is not a string
`KeyError`	If header doesn't exist

`init(headers=None, **kwargs)`

Initialize headers from a dictionary, dict-like object, or keyword arguments.

Parameters:

Name	Type	Description	Default
`headers`	`'Headers' \| Mapping[str, Any] \| None`	Optional dictionary, Headers instance, or any dict-like object	`None`
`**kwargs`		Optional keyword arguments to add as headers	`{}`

Raises:

Type	Description
`TypeError`	If headers is not a dict-like object with an items() method

`setitem(key, value)`

Set header value with case-insensitive key storage.

Setting a header to None will remove it from the collection. Non-string values are automatically converted to strings.

Parameters:

Name	Type	Description	Default
`key`	`str`	Header name (will be converted to lowercase)	required
`value`	`str \| None`	Header value, or None to remove the header	required

Raises:

Type	Description
`TypeError`	If key is not a string

`get(key, default=None)`

Get header value with optional default.

Parameters:

Name	Type	Description	Default
`key`		Header name (case-insensitive)	required
`default`		Value to return if header doesn't exist	`None`

Returns:

Type	Description
`str \| None`	Header value if it exists, otherwise the default value

`Resource` `dataclass`

URL-identified entity within the crawler system.

Resource is a generic representation of anything addressable by a URL within the Ethicrawl system. It serves as a common foundation for various components like requests, responses, robots.txt files, sitemap entries, etc.

This class provides URL type safety, consistent equality comparison, and proper hashing behavior for all URL-addressable entities.

Attributes:

Name	Type	Description
`url`	`Url`	The Url object identifying this resource. Can be initialized with either a string or Url object.

Raises:

Type	Description
`TypeError`	When initialized with something other than a string or Url object

Examples:

>>> resource = Resource("https://example.com/robots.txt")
>>> resource.url.path
'/robots.txt'
>>> resource2 = Resource(Url("https://example.com/robots.txt"))
>>> resource == resource2
True

`eq(other)`

Compare resources for equality based on their URLs.

Two resources are considered equal if they have the same URL.

Parameters:

Name	Type	Description	Default
`other`	`Any`	Another Resource object to compare with	required

Returns:

Type	Description
`bool`	True if resources have the same URL, False otherwise

`hash()`

Generate a hash based on the string representation of the URL.

Returns:

Type	Description
`int`	Integer hash value

`__post_init__()`

Validate and normalize the url attribute after initialization.

Converts string URLs to Url objects and raises TypeError for invalid types.

`repr()`

Return a developer-friendly representation.

`str()`

Return the URL as a string for better readability.

`ResourceList`

Bases: Generic[T]

Collection of Resource objects with filtering capabilities.

ResourceList provides list-like functionality specialized for managing collections of Resources with additional filtering methods and type safety. The class is generic and can contain any subclass of Resource.

Note

This class has no public attributes as all storage is private.

Examples:

>>> from ethicrawl.core import Resource, ResourceList
>>> resources = ResourceList()
>>> resources.append(Resource("https://example.com/page1"))
>>> resources.append(Resource("https://example.com/page2"))
>>> len(resources)
2
>>> filtered = resources.filter(r"page1")
>>> len(filtered)
1

`getitem(index)`

Get items by index or slice.

Parameters:

Name	Type	Description	Default
`index`	`int \| slice`	Integer index or slice object	required

Returns:

Type	Description
`T \| ResourceList[T]`	Single Resource when indexed with integer, ResourceList when sliced

`init(items=None)`

Initialize a resource list with optional initial items.

Parameters:

Name	Type	Description	Default
`items`	`list[T] \| None`	Optional list of Resource objects to initialize with	`None`

Raises:

Type	Description
`TypeError`	If items is not a list or contains non-Resource objects

`append(item)`

Add a resource to the list.

Parameters:

Name	Type	Description	Default
`item`	`T`	Resource object to add	required

Returns:

Type	Description
`ResourceList[T]`	Self for method chaining

Raises:

Type	Description
`TypeError`	If item is not a Resource object

`extend(items)`

Add multiple resources to the list.

Parameters:

Name	Type	Description	Default
`items`	`Iterable[T]`	Iterable of Resource objects to add	required

Returns:

Type	Description
`ResourceList[T]`	Self for method chaining

Raises:

Type	Description
`TypeError`	If any item is not a Resource object

`filter(pattern)`

Filter resources by URL pattern.

Parameters:

Name	Type	Description	Default
`pattern`	`str \| Pattern`	String pattern or compiled regex Pattern to match against URLs	required

Returns:

Type	Description
`ResourceList[T]`	New ResourceList containing only matching resources of the same type as original

`to_list()`

Convert to a standard Python list.

Returns:

Type	Description
`list[T]`	A copy of the internal list of resources

`Url`

URL parser and manipulation class.

This class provides methods for parsing, validating, and manipulating URLs. Supports HTTP, HTTPS, and file URL schemes with validation and component access. Path extension and query parameter manipulation are provided through the extend() method.

Attributes:

Name	Type	Description
`scheme`	`str`	URL scheme (http, https, file)
`netloc`	`str`	Network location/domain (HTTP/HTTPS only)
`hostname`	`str`	Just the hostname portion of netloc (HTTP/HTTPS only)
`path`	`str`	URL path component
`params`	`str`	URL parameters (HTTP/HTTPS only)
`query`	`str`	Raw query string (HTTP/HTTPS only)
`query_params`	`dict[str, Any]`	Query string parsed into a dictionary (HTTP/HTTPS only)
`fragment`	`str`	URL fragment identifier (HTTP/HTTPS only)
`base`	`str`	Base URL (scheme + netloc)

Raises:

Type	Description
`ValueError`	When provided with invalid URLs or when performing invalid operations

`base` `property`

Get the base URL (scheme and netloc).

Returns:

Type	Description
`str`	The base URL as a string (e.g., 'https://example.com')

`fragment` `property`

Get the fragment identifier from the URL.

The fragment appears after # in a URL and typically references a section within a document.

Returns:

Type	Description
`str`	Fragment string without the # character

Raises:

Type	Description
`ValueError`	If called on a non-HTTP(S) URL

`hostname` `property`

Get just the hostname part.

`netloc` `property`

Get the network location (domain).

`params` `property`

Get URL parameters.

`path` `property`

Get the path component.

`query` `property`

Get the query string.

`query_params` `property`

Get query parameters as a dictionary.

Returns:

Type	Description
`dict[str, Any]`	Dictionary of query parameter keys and values

Raises:

Type	Description
`ValueError`	If called on a non-HTTP(S) URL

`scheme` `property`

Get the URL scheme (file, http or https).

`eq(other)`

Compare URLs for equality.

Parameters:

Name	Type	Description	Default
`other`	`Any`	Another Url object or string to compare with	required

Returns:

Type	Description
`bool`	True if URLs are equal, False otherwise

`hash()`

Return a hash of the URL.

The hash is based on the string representation of the URL, ensuring URLs that are equal have the same hash.

Returns:

Type	Description
`int`	Integer hash value

`init(url, validate=False)`

Initialize a URL object with parsing and optional validation.

Parameters:

Name	Type	Description	Default
`url`	`Union[str, Url]`	String or Url object to parse	required
`validate`	`bool`	If True, performs additional validation including DNS resolution	`False`

Raises:

Type	Description
`ValueError`	When the URL has an invalid scheme or missing required components
`ValueError`	When validate=True and the hostname cannot be resolved

`str()`

Convert URL to string representation.

Returns:

Type	Description
`str`	Complete URL string

`extend(*args)`

Extend the URL with additional path components or query parameters.

This method supports multiple extension patterns: 1. Path extension: extend("path/component") 2. Single parameter: extend("param_name", "param_value") 3. Multiple parameters: extend({"param1": "value1", "param2": "value2"})

Parameters:

Name	Type	Description	Default
`*args`	`Any`	Either a path string, a parameter dict, or name/value parameter pair	`()`

Returns:

Type	Description
`Url`	A new Url object with the extended path or parameters

Raises:

Type	Description
`ValueError`	If arguments don't match one of the supported patterns
`ValueError`	If trying to add query parameters to a file:// URL

Examples:

>>> url = Url("https://example.com/api")
>>> url.extend("v1").extend({"format": "json"})
Url("https://example.com/api/v1?format=json")

Core Module

Headers

__contains__(key)

__getitem__(key)

__init__(headers=None, **kwargs)

__setitem__(key, value)

get(key, default=None)

Resource dataclass

__eq__(other)

__hash__()

__post_init__()

__repr__()

__str__()

ResourceList

__getitem__(index)

__init__(items=None)

append(item)

extend(items)

filter(pattern)

to_list()

Url

base property

fragment property

hostname property

netloc property

params property

path property

query property

query_params property

scheme property

__eq__(other)

__hash__()

__init__(url, validate=False)

__str__()

extend(*args)

`Headers`

`contains(key)`

`getitem(key)`

`init(headers=None, **kwargs)`

`setitem(key, value)`

`get(key, default=None)`

`Resource` `dataclass`

`eq(other)`

`hash()`

`__post_init__()`

`repr()`

`str()`

`ResourceList`

`getitem(index)`

`init(items=None)`

`append(item)`

`extend(items)`

`filter(pattern)`

`to_list()`

`Url`

`base` `property`

`fragment` `property`

`hostname` `property`

`netloc` `property`

`params` `property`

`path` `property`

`query` `property`

`query_params` `property`

`scheme` `property`

`eq(other)`

`hash()`

`init(url, validate=False)`

`str()`

`extend(*args)`