HttpClient Module

Client interfaces for making HTTP requests to resources.

`HttpClient`

Bases: Client

HTTP client implementation with configurable transports and rate limiting.

This client provides a flexible HTTP interface with the following features: - Configurable backend transport (Requests or Selenium Chrome) - Built-in rate limiting with jitter to avoid detection - Header management with User-Agent control - Automatic retry with exponential backoff - Detailed logging of request/response cycles

The client can use either a simple RequestsTransport for basic HTTP operations or a ChromeTransport for JavaScript-rendered content.

Attributes:

Name	Type	Description
`timeout`	`int`	Request timeout in seconds
`min_interval`	`float`	Minimum time between requests in seconds
`jitter`	`float`	Random time variation added to rate limiting
`headers`	`Headers`	Default headers to send with each request
`last_request_time`	`float`	Timestamp of the last request
`user_agent`	`str`	User agent string used for requests

Example

from ethicrawl.client.http import HttpClient from ethicrawl.core import Resource client = HttpClient(rate_limit=1.0) # 1 request per second response = client.get(Resource("https://example.com")) print(response.status_code) 200

Switch to Chrome for JavaScript-heavy sites

chrome_client = client.with_chrome(headless=True) js_response = chrome_client.get(Resource("https://spa-example.com"))

`init(context=None, transport=None, timeout=10, rate_limit=1.0, jitter=0.5, headers=None, chrome_params=None)`

Initialize an HTTP client with configurable transport and rate limiting.

Parameters:

Name	Type	Description	Default
`context`	`Context`	Context for the client. If None, a default context with a dummy URL will be created.	`None`
`transport`	`Transport`	Custom transport implementation. If None, either ChromeTransport or RequestsTransport will be used.	`None`
`timeout`	`int`	Request timeout in seconds	`10`
`rate_limit`	`float`	Maximum requests per second. Set to 0 for no limit.	`1.0`
`jitter`	`float`	Random variation (0-1) to add to rate limiting	`0.5`
`headers`	`dict`	Default headers to send with each request	`None`
`chrome_params`	`dict`	Parameters for ChromeTransport if used	`None`

`get(resource, timeout=None, headers=None)`

Make a GET request to the specified resource.

This method applies rate limiting, handles headers, and logs the result. For JavaScript-heavy sites, use with_chrome() first to switch to a Chrome-based transport.

Parameters:

Name	Type	Description	Default
`resource`	`Resource`	The resource to request	required
`timeout`	`int`	Request-specific timeout that overrides the client's default timeout	`None`
`headers`	`dict`	Additional headers for this request	`None`

Returns:

Name	Type	Description
`HttpResponse`	`HttpResponse`	Response object with status, headers and content

Raises:

Type	Description
`TypeError`	If resource is not a Resource instance
`IOError`	If the HTTP request fails for any reason

Example

client = HttpClient() response = client.get(Resource("https://example.com")) if response.status_code == 200: ... print(f"Got {len(response.content)} bytes")

`with_chrome(headless=True, wait_time=3, timeout=30, rate_limit=0.5, jitter=0.3)`

Create a new HttpClient instance using Chrome/Selenium transport.

This creates a new client that can render JavaScript and interact with dynamic web applications.

Parameters:

Name	Type	Description	Default
`headless`	`bool`	Whether to run Chrome in headless mode	`True`
`wait_time`	`int`	Default time to wait for page elements in seconds	`3`
`timeout`	`int`	Request timeout in seconds	`30`
`rate_limit`	`float`	Maximum requests per second	`0.5`
`jitter`	`float`	Random variation factor for rate limiting	`0.3`

Returns:

Name	Type	Description
`HttpClient`	`HttpClient`	A new client instance configured to use Chrome

Example

client = HttpClient() chrome = client.with_chrome(headless=True) response = chrome.get(Resource("https://single-page-app.com"))

`HttpRequest` `dataclass`

Bases: Request

HTTP-specific request implementation with timeout and header management.

This class extends the base Request with HTTP-specific functionality, including configurable timeout and header handling. It automatically applies default headers from the global configuration while allowing custom headers to take precedence.

Attributes:

Name	Type	Description
`url`	`Url`	The target URL (inherited from Request)
`headers`	`Headers`	HTTP headers to send with the request
`_timeout`	`float`	Request timeout in seconds

Example

from ethicrawl.client.http import HttpRequest from ethicrawl.core import Url req = HttpRequest(Url("https://example.com")) req.headers["User-Agent"] = "EthiCrawl/1.0" req.timeout = 15.0

`timeout` `property` `writable`

Get the request timeout in seconds.

Returns:

Type	Description
`float`	The timeout value in seconds

`__post_init__()`

Initialize and validate the request after creation.

Ensures headers are a proper Headers instance and applies default headers from configuration if not already present.

`HttpResponse` `dataclass`

Bases: Response

HTTP-specific response implementation with status codes and text content.

This class extends the base Response with HTTP-specific attributes and behaviors, including status code, headers, and separate text content representation. It provides robust validation and a comprehensive string representation for debugging and logging.

The HttpResponse maintains the connection between the original request and the response while enforcing type safety and data consistency.

Attributes:

Name	Type	Description
`request`	`HttpRequest`	The request that generated this response
`status_code`	`int`	HTTP status code (200, 404, etc.)
`headers`	`Headers`	HTTP response headers
`content`	`bytes`	Binary content of the response (inherited from Response)
`text`	`str`	Text content decoded from binary content (for text responses)
`url`	`Url`	The response URL, which may differ from request URL after redirects

Example

from ethicrawl.client.http import HttpRequest, HttpResponse from ethicrawl.core import Resource, Headers req = HttpRequest(Resource("https://example.com")) resp = HttpResponse( ... request=req, ... status_code=200, ... content=b"Example", ... text="Example", ... headers=Headers({"Content-Type": "text/html"}) ... ) resp.status_code 200 "html" in resp.text True

`__post_init__()`

Validate the response attributes after initialization.

Performs type checking and value validation for: - Status code (must be int between 100-599) - Request (must be HttpRequest instance) - Content (must be bytes or None) - Text (must be str or None)

Also calls the parent class post_init for further validation.

Raises:

Type	Description
`TypeError`	If any attribute has an invalid type
`ValueError`	If status_code is outside valid HTTP range (100-599)

`str()`

Format a human-readable representation of the response.

Creates a formatted multi-line string containing: - Status code - URL (showing both response URL and request URL if they differ) - Headers - Content summary (preview for text, byte count for binary) - Text preview for text content types (up to 300 chars)

Returns:

Type	Description
`str`	String representation of the response with formatted content preview

HttpClient Module

HttpClient

Switch to Chrome for JavaScript-heavy sites

__init__(context=None, transport=None, timeout=10, rate_limit=1.0, jitter=0.5, headers=None, chrome_params=None)

get(resource, timeout=None, headers=None)

with_chrome(headless=True, wait_time=3, timeout=30, rate_limit=0.5, jitter=0.3)

HttpRequest dataclass

timeout property writable

__post_init__()

HttpResponse dataclass

__post_init__()

__str__()

`HttpClient`

`init(context=None, transport=None, timeout=10, rate_limit=1.0, jitter=0.5, headers=None, chrome_params=None)`

`get(resource, timeout=None, headers=None)`

`with_chrome(headless=True, wait_time=3, timeout=30, rate_limit=0.5, jitter=0.3)`

`HttpRequest` `dataclass`

`timeout` `property` `writable`

`__post_init__()`

`HttpResponse` `dataclass`

`__post_init__()`

`str()`