Robots Module
Robots.txt parsing and permission checking
Robot
dataclass
Bases: Resource
Representation of a site's robots.txt file with permission checking.
This class handles fetching, parsing, and enforcing robots.txt rules according to the Robots Exclusion Protocol. It follows standard robots.txt behavior: - 404 response: allow all URLs (fail open) - 200 response: parse and enforce rules in the robots.txt file - Other responses (5xx, etc.): deny all URLs (fail closed)
As a Resource subclass, Robot maintains the URL identity of the robots.txt file while providing methods to check permissions and access sitemap references.
Attributes:
| Name | Type | Description |
|---|---|---|
url |
Url
|
URL of the robots.txt file (inherited from Resource) |
context |
Context
|
Context with client for making requests |
Example
from ethicrawl.context import Context from ethicrawl.core import Resource, Url from ethicrawl.robots import Robot from ethicrawl.client.http import HttpClient client = HttpClient() context = Context(Resource("https://example.com"), client) robot = Robot(Url("https://example.com/robots.txt"), context) robot.can_fetch("https://example.com/allowed") True
context
property
Get the context associated with this robot.
Returns:
| Type | Description |
|---|---|
Context
|
The context object containing client and other settings |
sitemaps
property
Get sitemap URLs referenced in robots.txt.
Returns:
| Type | Description |
|---|---|
ResourceList
|
ResourceList containing IndexEntry objects for each sitemap URL |
Example
for sitemap in robot.sitemaps: ... print(f"Found sitemap: {sitemap.url}")
__post_init__()
Initialize the robot instance and fetch robots.txt.
Fetches the robots.txt file using the provided context's client, then parses it according to response status: - 404: Create empty ruleset (allow all) - 200: Parse actual robots.txt content - Other: Create restrictive ruleset (deny all)
can_fetch(resource, user_agent=None)
Check if a URL can be fetched according to robots.txt rules.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
resource
|
Resource | Url | str
|
The URL to check against robots.txt rules |
required |
user_agent
|
str | None
|
Optional user agent string to use for checking. If not provided, uses client's user_agent or config default. |
None
|
Returns:
| Type | Description |
|---|---|
bool
|
True if the URL is allowed by robots.txt |
Raises:
| Type | Description |
|---|---|
TypeError
|
If resource is not a string, Url, or Resource |
RobotDisallowedError
|
If the URL is disallowed by robots.txt |
Example
if robot.can_fetch("https://example.com/page"): ... response = client.get(Resource("https://example.com/page"))
RobotFactory
Factory for creating Robot instances and robots.txt URLs.
This utility class provides methods for: - Converting site URLs to robots.txt URLs - Creating properly initialized Robot instances from a Context
Using this factory ensures consistent Robot creation throughout the application and handles the necessary URL transformations.
robot(context)
staticmethod
Create a Robot instance from a Context.
Extracts the URL from the context's resource, converts it to a robots.txt URL, and creates a new Robot instance bound to that URL and the provided context.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
The context to use for the Robot |
required |
Returns:
| Type | Description |
|---|---|
Robot
|
A fully initialized Robot instance |
Raises:
| Type | Description |
|---|---|
TypeError
|
If context is not a Context instance |
Example
from ethicrawl.context import Context from ethicrawl.core import Resource from ethicrawl.robots import RobotFactory ctx = Context(Resource("https://example.com")) robot = RobotFactory.robot(ctx) robot.can_fetch("https://example.com/page") True
robotify(url)
staticmethod
Convert a site URL to its corresponding robots.txt URL.
Takes any URL and transforms it to the canonical robots.txt URL by extracting the base URL and appending "robots.txt".
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
Url
|
The site URL to convert |
required |
Returns:
| Type | Description |
|---|---|
Url
|
A new Url pointing to the site's robots.txt file |
Raises:
| Type | Description |
|---|---|
TypeError
|
If url is not a Url instance |
Example
from ethicrawl.core import Url from ethicrawl.robots import RobotFactory site = Url("https://example.com/some/page") robots_url = RobotFactory.robotify(site) str(robots_url) 'https://example.com/robots.txt'