Skip to content

HttpClient Module

Client interfaces for making HTTP requests to resources.

HttpClient

Bases: Client

HTTP client implementation with configurable transports and rate limiting.

This client provides a flexible HTTP interface with the following features: - Configurable backend transport (Requests or Selenium Chrome) - Built-in rate limiting with jitter to avoid detection - Header management with User-Agent control - Automatic retry with exponential backoff - Detailed logging of request/response cycles

The client can use either a simple RequestsTransport for basic HTTP operations or a ChromeTransport for JavaScript-rendered content.

Attributes:

Name Type Description
timeout int

Request timeout in seconds

min_interval float

Minimum time between requests in seconds

jitter float

Random time variation added to rate limiting

headers Headers

Default headers to send with each request

last_request_time float

Timestamp of the last request

user_agent str

User agent string used for requests

Example

from ethicrawl.client.http import HttpClient from ethicrawl.core import Resource client = HttpClient(rate_limit=1.0) # 1 request per second response = client.get(Resource("https://example.com")) print(response.status_code) 200

Switch to Chrome for JavaScript-heavy sites

chrome_client = client.with_chrome(headless=True) js_response = chrome_client.get(Resource("https://spa-example.com"))

__init__(context=None, transport=None, timeout=10, rate_limit=1.0, jitter=0.5, headers=None, chrome_params=None)

Initialize an HTTP client with configurable transport and rate limiting.

Parameters:

Name Type Description Default
context Context

Context for the client. If None, a default context with a dummy URL will be created.

None
transport Transport

Custom transport implementation. If None, either ChromeTransport or RequestsTransport will be used.

None
timeout int

Request timeout in seconds

10
rate_limit float

Maximum requests per second. Set to 0 for no limit.

1.0
jitter float

Random variation (0-1) to add to rate limiting

0.5
headers dict

Default headers to send with each request

None
chrome_params dict

Parameters for ChromeTransport if used

None

get(resource, timeout=None, headers=None)

Make a GET request to the specified resource.

This method applies rate limiting, handles headers, and logs the result. For JavaScript-heavy sites, use with_chrome() first to switch to a Chrome-based transport.

Parameters:

Name Type Description Default
resource Resource

The resource to request

required
timeout int

Request-specific timeout that overrides the client's default timeout

None
headers dict

Additional headers for this request

None

Returns:

Name Type Description
HttpResponse HttpResponse

Response object with status, headers and content

Raises:

Type Description
TypeError

If resource is not a Resource instance

IOError

If the HTTP request fails for any reason

Example

client = HttpClient() response = client.get(Resource("https://example.com")) if response.status_code == 200: ... print(f"Got {len(response.content)} bytes")

with_chrome(headless=True, wait_time=3, timeout=30, rate_limit=0.5, jitter=0.3)

Create a new HttpClient instance using Chrome/Selenium transport.

This creates a new client that can render JavaScript and interact with dynamic web applications.

Parameters:

Name Type Description Default
headless bool

Whether to run Chrome in headless mode

True
wait_time int

Default time to wait for page elements in seconds

3
timeout int

Request timeout in seconds

30
rate_limit float

Maximum requests per second

0.5
jitter float

Random variation factor for rate limiting

0.3

Returns:

Name Type Description
HttpClient HttpClient

A new client instance configured to use Chrome

Example

client = HttpClient() chrome = client.with_chrome(headless=True) response = chrome.get(Resource("https://single-page-app.com"))

HttpRequest dataclass

Bases: Request

HTTP-specific request implementation with timeout and header management.

This class extends the base Request with HTTP-specific functionality, including configurable timeout and header handling. It automatically applies default headers from the global configuration while allowing custom headers to take precedence.

Attributes:

Name Type Description
url Url

The target URL (inherited from Request)

headers Headers

HTTP headers to send with the request

_timeout float

Request timeout in seconds

Example

from ethicrawl.client.http import HttpRequest from ethicrawl.core import Url req = HttpRequest(Url("https://example.com")) req.headers["User-Agent"] = "EthiCrawl/1.0" req.timeout = 15.0

timeout property writable

Get the request timeout in seconds.

Returns:

Type Description
float

The timeout value in seconds

__post_init__()

Initialize and validate the request after creation.

Ensures headers are a proper Headers instance and applies default headers from configuration if not already present.

HttpResponse dataclass

Bases: Response

HTTP-specific response implementation with status codes and text content.

This class extends the base Response with HTTP-specific attributes and behaviors, including status code, headers, and separate text content representation. It provides robust validation and a comprehensive string representation for debugging and logging.

The HttpResponse maintains the connection between the original request and the response while enforcing type safety and data consistency.

Attributes:

Name Type Description
request HttpRequest

The request that generated this response

status_code int

HTTP status code (200, 404, etc.)

headers Headers

HTTP response headers

content bytes

Binary content of the response (inherited from Response)

text str

Text content decoded from binary content (for text responses)

url Url

The response URL, which may differ from request URL after redirects

Example

from ethicrawl.client.http import HttpRequest, HttpResponse from ethicrawl.core import Resource, Headers req = HttpRequest(Resource("https://example.com")) resp = HttpResponse( ... request=req, ... status_code=200, ... content=b"Example", ... text="Example", ... headers=Headers({"Content-Type": "text/html"}) ... ) resp.status_code 200 "html" in resp.text True

__post_init__()

Validate the response attributes after initialization.

Performs type checking and value validation for: - Status code (must be int between 100-599) - Request (must be HttpRequest instance) - Content (must be bytes or None) - Text (must be str or None)

Also calls the parent class post_init for further validation.

Raises:

Type Description
TypeError

If any attribute has an invalid type

ValueError

If status_code is outside valid HTTP range (100-599)

__str__()

Format a human-readable representation of the response.

Creates a formatted multi-line string containing: - Status code - URL (showing both response URL and request URL if they differ) - Headers - Content summary (preview for text, byte count for binary) - Text preview for text content types (up to 300 chars)

Returns:

Type Description
str

String representation of the response with formatted content preview