HttpClient Module
Client interfaces for making HTTP requests to resources.
HttpClient
Bases: Client
HTTP client implementation with configurable transports and rate limiting.
This client provides a flexible HTTP interface with the following features: - Configurable backend transport (Requests or Selenium Chrome) - Built-in rate limiting with jitter to avoid detection - Header management with User-Agent control - Automatic retry with exponential backoff - Detailed logging of request/response cycles
The client can use either a simple RequestsTransport for basic HTTP operations or a ChromeTransport for JavaScript-rendered content.
Attributes:
| Name | Type | Description |
|---|---|---|
timeout |
int
|
Request timeout in seconds |
min_interval |
float
|
Minimum time between requests in seconds |
jitter |
float
|
Random time variation added to rate limiting |
headers |
Headers
|
Default headers to send with each request |
last_request_time |
float
|
Timestamp of the last request |
user_agent |
str
|
User agent string used for requests |
Example
from ethicrawl.client.http import HttpClient from ethicrawl.core import Resource client = HttpClient(rate_limit=1.0) # 1 request per second response = client.get(Resource("https://example.com")) print(response.status_code) 200
Switch to Chrome for JavaScript-heavy sites
chrome_client = client.with_chrome(headless=True) js_response = chrome_client.get(Resource("https://spa-example.com"))
__init__(context=None, transport=None, timeout=10, rate_limit=1.0, jitter=0.5, headers=None, chrome_params=None)
Initialize an HTTP client with configurable transport and rate limiting.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
Context for the client. If None, a default context with a dummy URL will be created. |
None
|
transport
|
Transport
|
Custom transport implementation. If None, either ChromeTransport or RequestsTransport will be used. |
None
|
timeout
|
int
|
Request timeout in seconds |
10
|
rate_limit
|
float
|
Maximum requests per second. Set to 0 for no limit. |
1.0
|
jitter
|
float
|
Random variation (0-1) to add to rate limiting |
0.5
|
headers
|
dict
|
Default headers to send with each request |
None
|
chrome_params
|
dict
|
Parameters for ChromeTransport if used |
None
|
get(resource, timeout=None, headers=None)
Make a GET request to the specified resource.
This method applies rate limiting, handles headers, and logs the result. For JavaScript-heavy sites, use with_chrome() first to switch to a Chrome-based transport.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
resource
|
Resource
|
The resource to request |
required |
timeout
|
int
|
Request-specific timeout that overrides the client's default timeout |
None
|
headers
|
dict
|
Additional headers for this request |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
HttpResponse |
HttpResponse
|
Response object with status, headers and content |
Raises:
| Type | Description |
|---|---|
TypeError
|
If resource is not a Resource instance |
IOError
|
If the HTTP request fails for any reason |
Example
client = HttpClient() response = client.get(Resource("https://example.com")) if response.status_code == 200: ... print(f"Got {len(response.content)} bytes")
with_chrome(headless=True, wait_time=3, timeout=30, rate_limit=0.5, jitter=0.3)
Create a new HttpClient instance using Chrome/Selenium transport.
This creates a new client that can render JavaScript and interact with dynamic web applications.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
headless
|
bool
|
Whether to run Chrome in headless mode |
True
|
wait_time
|
int
|
Default time to wait for page elements in seconds |
3
|
timeout
|
int
|
Request timeout in seconds |
30
|
rate_limit
|
float
|
Maximum requests per second |
0.5
|
jitter
|
float
|
Random variation factor for rate limiting |
0.3
|
Returns:
| Name | Type | Description |
|---|---|---|
HttpClient |
HttpClient
|
A new client instance configured to use Chrome |
Example
client = HttpClient() chrome = client.with_chrome(headless=True) response = chrome.get(Resource("https://single-page-app.com"))
HttpRequest
dataclass
Bases: Request
HTTP-specific request implementation with timeout and header management.
This class extends the base Request with HTTP-specific functionality, including configurable timeout and header handling. It automatically applies default headers from the global configuration while allowing custom headers to take precedence.
Attributes:
| Name | Type | Description |
|---|---|---|
url |
Url
|
The target URL (inherited from Request) |
headers |
Headers
|
HTTP headers to send with the request |
_timeout |
float
|
Request timeout in seconds |
Example
from ethicrawl.client.http import HttpRequest from ethicrawl.core import Url req = HttpRequest(Url("https://example.com")) req.headers["User-Agent"] = "EthiCrawl/1.0" req.timeout = 15.0
timeout
property
writable
Get the request timeout in seconds.
Returns:
| Type | Description |
|---|---|
float
|
The timeout value in seconds |
__post_init__()
Initialize and validate the request after creation.
Ensures headers are a proper Headers instance and applies default headers from configuration if not already present.
HttpResponse
dataclass
Bases: Response
HTTP-specific response implementation with status codes and text content.
This class extends the base Response with HTTP-specific attributes and behaviors, including status code, headers, and separate text content representation. It provides robust validation and a comprehensive string representation for debugging and logging.
The HttpResponse maintains the connection between the original request and the response while enforcing type safety and data consistency.
Attributes:
| Name | Type | Description |
|---|---|---|
request |
HttpRequest
|
The request that generated this response |
status_code |
int
|
HTTP status code (200, 404, etc.) |
headers |
Headers
|
HTTP response headers |
content |
bytes
|
Binary content of the response (inherited from Response) |
text |
str
|
Text content decoded from binary content (for text responses) |
url |
Url
|
The response URL, which may differ from request URL after redirects |
Example
from ethicrawl.client.http import HttpRequest, HttpResponse from ethicrawl.core import Resource, Headers req = HttpRequest(Resource("https://example.com")) resp = HttpResponse( ... request=req, ... status_code=200, ... content=b"Example", ... text="Example", ... headers=Headers({"Content-Type": "text/html"}) ... ) resp.status_code 200 "html" in resp.text True
__post_init__()
Validate the response attributes after initialization.
Performs type checking and value validation for: - Status code (must be int between 100-599) - Request (must be HttpRequest instance) - Content (must be bytes or None) - Text (must be str or None)
Also calls the parent class post_init for further validation.
Raises:
| Type | Description |
|---|---|
TypeError
|
If any attribute has an invalid type |
ValueError
|
If status_code is outside valid HTTP range (100-599) |
__str__()
Format a human-readable representation of the response.
Creates a formatted multi-line string containing: - Status code - URL (showing both response URL and request URL if they differ) - Headers - Content summary (preview for text, byte count for binary) - Text preview for text content types (up to 300 chars)
Returns:
| Type | Description |
|---|---|
str
|
String representation of the response with formatted content preview |