Skip to content

Core Module

Core types and utilities for resource identification and handling.

Headers

Bases: dict

HTTP headers container with case-insensitive key access.

This class extends the standard dictionary to provide case-insensitive access to HTTP header keys, conforming to HTTP specifications. It ensures proper type handling and provides flexible initialization from various header sources.

Inherits all dictionary attributes

Examples:

>>> headers = Headers({"Content-Type": "text/html"})
>>> headers["content-type"]
'text/html'
>>> headers["CONTENT-TYPE"] = "application/json"
>>> headers["content-type"]
'application/json'
>>> "content-type" in headers
True
>>> "CONTENT-TYPE" in headers
True

__contains__(key)

Check if header exists with case-insensitive comparison.

Parameters:

Name Type Description Default
key

Header name to check (case-insensitive)

required

Returns:

Type Description
bool

True if header exists, False otherwise

__getitem__(key)

Get header value with case-insensitive key access.

Parameters:

Name Type Description Default
key

Header name (case-insensitive)

required

Returns:

Type Description

The header value

Raises:

Type Description
TypeError

If key is not a string

KeyError

If header doesn't exist

__init__(headers=None, **kwargs)

Initialize headers from a dictionary, dict-like object, or keyword arguments.

Parameters:

Name Type Description Default
headers 'Headers' | Mapping[str, Any] | None

Optional dictionary, Headers instance, or any dict-like object

None
**kwargs

Optional keyword arguments to add as headers

{}

Raises:

Type Description
TypeError

If headers is not a dict-like object with an items() method

__setitem__(key, value)

Set header value with case-insensitive key storage.

Setting a header to None will remove it from the collection. Non-string values are automatically converted to strings.

Parameters:

Name Type Description Default
key str

Header name (will be converted to lowercase)

required
value str | None

Header value, or None to remove the header

required

Raises:

Type Description
TypeError

If key is not a string

get(key, default=None)

Get header value with optional default.

Parameters:

Name Type Description Default
key

Header name (case-insensitive)

required
default

Value to return if header doesn't exist

None

Returns:

Type Description
str | None

Header value if it exists, otherwise the default value

Resource dataclass

URL-identified entity within the crawler system.

Resource is a generic representation of anything addressable by a URL within the Ethicrawl system. It serves as a common foundation for various components like requests, responses, robots.txt files, sitemap entries, etc.

This class provides URL type safety, consistent equality comparison, and proper hashing behavior for all URL-addressable entities.

Attributes:

Name Type Description
url Url

The Url object identifying this resource. Can be initialized with either a string or Url object.

Raises:

Type Description
TypeError

When initialized with something other than a string or Url object

Examples:

>>> resource = Resource("https://example.com/robots.txt")
>>> resource.url.path
'/robots.txt'
>>> resource2 = Resource(Url("https://example.com/robots.txt"))
>>> resource == resource2
True

__eq__(other)

Compare resources for equality based on their URLs.

Two resources are considered equal if they have the same URL.

Parameters:

Name Type Description Default
other Any

Another Resource object to compare with

required

Returns:

Type Description
bool

True if resources have the same URL, False otherwise

__hash__()

Generate a hash based on the string representation of the URL.

Returns:

Type Description
int

Integer hash value

__post_init__()

Validate and normalize the url attribute after initialization.

Converts string URLs to Url objects and raises TypeError for invalid types.

__repr__()

Return a developer-friendly representation.

__str__()

Return the URL as a string for better readability.

ResourceList

Bases: Generic[T]

Collection of Resource objects with filtering capabilities.

ResourceList provides list-like functionality specialized for managing collections of Resources with additional filtering methods and type safety. The class is generic and can contain any subclass of Resource.

Note

This class has no public attributes as all storage is private.

Examples:

>>> from ethicrawl.core import Resource, ResourceList
>>> resources = ResourceList()
>>> resources.append(Resource("https://example.com/page1"))
>>> resources.append(Resource("https://example.com/page2"))
>>> len(resources)
2
>>> filtered = resources.filter(r"page1")
>>> len(filtered)
1

__getitem__(index)

Get items by index or slice.

Parameters:

Name Type Description Default
index int | slice

Integer index or slice object

required

Returns:

Type Description
T | ResourceList[T]

Single Resource when indexed with integer, ResourceList when sliced

__init__(items=None)

Initialize a resource list with optional initial items.

Parameters:

Name Type Description Default
items list[T] | None

Optional list of Resource objects to initialize with

None

Raises:

Type Description
TypeError

If items is not a list or contains non-Resource objects

append(item)

Add a resource to the list.

Parameters:

Name Type Description Default
item T

Resource object to add

required

Returns:

Type Description
ResourceList[T]

Self for method chaining

Raises:

Type Description
TypeError

If item is not a Resource object

extend(items)

Add multiple resources to the list.

Parameters:

Name Type Description Default
items Iterable[T]

Iterable of Resource objects to add

required

Returns:

Type Description
ResourceList[T]

Self for method chaining

Raises:

Type Description
TypeError

If any item is not a Resource object

filter(pattern)

Filter resources by URL pattern.

Parameters:

Name Type Description Default
pattern str | Pattern

String pattern or compiled regex Pattern to match against URLs

required

Returns:

Type Description
ResourceList[T]

New ResourceList containing only matching resources of the same type as original

to_list()

Convert to a standard Python list.

Returns:

Type Description
list[T]

A copy of the internal list of resources

Url

URL parser and manipulation class.

This class provides methods for parsing, validating, and manipulating URLs. Supports HTTP, HTTPS, and file URL schemes with validation and component access. Path extension and query parameter manipulation are provided through the extend() method.

Attributes:

Name Type Description
scheme str

URL scheme (http, https, file)

netloc str

Network location/domain (HTTP/HTTPS only)

hostname str

Just the hostname portion of netloc (HTTP/HTTPS only)

path str

URL path component

params str

URL parameters (HTTP/HTTPS only)

query str

Raw query string (HTTP/HTTPS only)

query_params dict[str, Any]

Query string parsed into a dictionary (HTTP/HTTPS only)

fragment str

URL fragment identifier (HTTP/HTTPS only)

base str

Base URL (scheme + netloc)

Raises:

Type Description
ValueError

When provided with invalid URLs or when performing invalid operations

base property

Get the base URL (scheme and netloc).

Returns:

Type Description
str

The base URL as a string (e.g., 'https://example.com')

fragment property

Get the fragment identifier from the URL.

The fragment appears after # in a URL and typically references a section within a document.

Returns:

Type Description
str

Fragment string without the # character

Raises:

Type Description
ValueError

If called on a non-HTTP(S) URL

hostname property

Get just the hostname part.

netloc property

Get the network location (domain).

params property

Get URL parameters.

path property

Get the path component.

query property

Get the query string.

query_params property

Get query parameters as a dictionary.

Returns:

Type Description
dict[str, Any]

Dictionary of query parameter keys and values

Raises:

Type Description
ValueError

If called on a non-HTTP(S) URL

scheme property

Get the URL scheme (file, http or https).

__eq__(other)

Compare URLs for equality.

Parameters:

Name Type Description Default
other Any

Another Url object or string to compare with

required

Returns:

Type Description
bool

True if URLs are equal, False otherwise

__hash__()

Return a hash of the URL.

The hash is based on the string representation of the URL, ensuring URLs that are equal have the same hash.

Returns:

Type Description
int

Integer hash value

__init__(url, validate=False)

Initialize a URL object with parsing and optional validation.

Parameters:

Name Type Description Default
url Union[str, Url]

String or Url object to parse

required
validate bool

If True, performs additional validation including DNS resolution

False

Raises:

Type Description
ValueError

When the URL has an invalid scheme or missing required components

ValueError

When validate=True and the hostname cannot be resolved

__str__()

Convert URL to string representation.

Returns:

Type Description
str

Complete URL string

extend(*args)

Extend the URL with additional path components or query parameters.

This method supports multiple extension patterns: 1. Path extension: extend("path/component") 2. Single parameter: extend("param_name", "param_value") 3. Multiple parameters: extend({"param1": "value1", "param2": "value2"})

Parameters:

Name Type Description Default
*args Any

Either a path string, a parameter dict, or name/value parameter pair

()

Returns:

Type Description
Url

A new Url object with the extended path or parameters

Raises:

Type Description
ValueError

If arguments don't match one of the supported patterns

ValueError

If trying to add query parameters to a file:// URL

Examples:

>>> url = Url("https://example.com/api")
>>> url.extend("v1").extend({"format": "json"})
Url("https://example.com/api/v1?format=json")