Robots Module

Robots.txt parsing and permission checking

`Robot` `dataclass`

Bases: Resource

Representation of a site's robots.txt file with permission checking.

This class handles fetching, parsing, and enforcing robots.txt rules according to the Robots Exclusion Protocol. It follows standard robots.txt behavior: - 404 response: allow all URLs (fail open) - 200 response: parse and enforce rules in the robots.txt file - Other responses (5xx, etc.): deny all URLs (fail closed)

As a Resource subclass, Robot maintains the URL identity of the robots.txt file while providing methods to check permissions and access sitemap references.

Attributes:

Name	Type	Description
`url`	`Url`	URL of the robots.txt file (inherited from Resource)
`context`	`Context`	Context with client for making requests

Example

from ethicrawl.context import Context from ethicrawl.core import Resource, Url from ethicrawl.robots import Robot from ethicrawl.client.http import HttpClient client = HttpClient() context = Context(Resource("https://example.com"), client) robot = Robot(Url("https://example.com/robots.txt"), context) robot.can_fetch("https://example.com/allowed") True

`context` `property`

Get the context associated with this robot.

Returns:

Type	Description
`Context`	The context object containing client and other settings

`sitemaps` `property`

Get sitemap URLs referenced in robots.txt.

Returns:

Type	Description
`ResourceList`	ResourceList containing IndexEntry objects for each sitemap URL

Example

for sitemap in robot.sitemaps: ... print(f"Found sitemap: {sitemap.url}")

`__post_init__()`

Initialize the robot instance and fetch robots.txt.

Fetches the robots.txt file using the provided context's client, then parses it according to response status: - 404: Create empty ruleset (allow all) - 200: Parse actual robots.txt content - Other: Create restrictive ruleset (deny all)

`can_fetch(resource, user_agent=None)`

Check if a URL can be fetched according to robots.txt rules.

Parameters:

Name	Type	Description	Default
`resource`	`Resource \| Url \| str`	The URL to check against robots.txt rules	required
`user_agent`	`str \| None`	Optional user agent string to use for checking. If not provided, uses client's user_agent or config default.	`None`

Returns:

Type	Description
`bool`	True if the URL is allowed by robots.txt

Raises:

Type	Description
`TypeError`	If resource is not a string, Url, or Resource
`RobotDisallowedError`	If the URL is disallowed by robots.txt

Example

if robot.can_fetch("https://example.com/page"): ... response = client.get(Resource("https://example.com/page"))

`RobotFactory`

Factory for creating Robot instances and robots.txt URLs.

This utility class provides methods for: - Converting site URLs to robots.txt URLs - Creating properly initialized Robot instances from a Context

Using this factory ensures consistent Robot creation throughout the application and handles the necessary URL transformations.

`robot(context)` `staticmethod`

Create a Robot instance from a Context.

Extracts the URL from the context's resource, converts it to a robots.txt URL, and creates a new Robot instance bound to that URL and the provided context.

Parameters:

Name	Type	Description	Default
`context`	`Context`	The context to use for the Robot	required

Returns:

Type	Description
`Robot`	A fully initialized Robot instance

Raises:

Type	Description
`TypeError`	If context is not a Context instance

Example

from ethicrawl.context import Context from ethicrawl.core import Resource from ethicrawl.robots import RobotFactory ctx = Context(Resource("https://example.com")) robot = RobotFactory.robot(ctx) robot.can_fetch("https://example.com/page") True

`robotify(url)` `staticmethod`

Convert a site URL to its corresponding robots.txt URL.

Takes any URL and transforms it to the canonical robots.txt URL by extracting the base URL and appending "robots.txt".

Parameters:

Name	Type	Description	Default
`url`	`Url`	The site URL to convert	required

Returns:

Type	Description
`Url`	A new Url pointing to the site's robots.txt file

Raises:

Type	Description
`TypeError`	If url is not a Url instance

Example

from ethicrawl.core import Url from ethicrawl.robots import RobotFactory site = Url("https://example.com/some/page") robots_url = RobotFactory.robotify(site) str(robots_url) 'https://example.com/robots.txt'

Robots Module

Robot dataclass

context property

sitemaps property

__post_init__()

can_fetch(resource, user_agent=None)

RobotFactory

robot(context) staticmethod

robotify(url) staticmethod

`Robot` `dataclass`

`context` `property`

`sitemaps` `property`

`__post_init__()`

`can_fetch(resource, user_agent=None)`

`RobotFactory`

`robot(context)` `staticmethod`

`robotify(url)` `staticmethod`