Skip to content

Robots Module

Robots.txt parsing and permission checking

Robot dataclass

Bases: Resource

Representation of a site's robots.txt file with permission checking.

This class handles fetching, parsing, and enforcing robots.txt rules according to the Robots Exclusion Protocol. It follows standard robots.txt behavior: - 404 response: allow all URLs (fail open) - 200 response: parse and enforce rules in the robots.txt file - Other responses (5xx, etc.): deny all URLs (fail closed)

As a Resource subclass, Robot maintains the URL identity of the robots.txt file while providing methods to check permissions and access sitemap references.

Attributes:

Name Type Description
url Url

URL of the robots.txt file (inherited from Resource)

context Context

Context with client for making requests

Example

from ethicrawl.context import Context from ethicrawl.core import Resource, Url from ethicrawl.robots import Robot from ethicrawl.client.http import HttpClient client = HttpClient() context = Context(Resource("https://example.com"), client) robot = Robot(Url("https://example.com/robots.txt"), context) robot.can_fetch("https://example.com/allowed") True

context property

Get the context associated with this robot.

Returns:

Type Description
Context

The context object containing client and other settings

sitemaps property

Get sitemap URLs referenced in robots.txt.

Returns:

Type Description
ResourceList

ResourceList containing IndexEntry objects for each sitemap URL

Example

for sitemap in robot.sitemaps: ... print(f"Found sitemap: {sitemap.url}")

__post_init__()

Initialize the robot instance and fetch robots.txt.

Fetches the robots.txt file using the provided context's client, then parses it according to response status: - 404: Create empty ruleset (allow all) - 200: Parse actual robots.txt content - Other: Create restrictive ruleset (deny all)

can_fetch(resource, user_agent=None)

Check if a URL can be fetched according to robots.txt rules.

Parameters:

Name Type Description Default
resource Resource | Url | str

The URL to check against robots.txt rules

required
user_agent str | None

Optional user agent string to use for checking. If not provided, uses client's user_agent or config default.

None

Returns:

Type Description
bool

True if the URL is allowed by robots.txt

Raises:

Type Description
TypeError

If resource is not a string, Url, or Resource

RobotDisallowedError

If the URL is disallowed by robots.txt

Example

if robot.can_fetch("https://example.com/page"): ... response = client.get(Resource("https://example.com/page"))

RobotFactory

Factory for creating Robot instances and robots.txt URLs.

This utility class provides methods for: - Converting site URLs to robots.txt URLs - Creating properly initialized Robot instances from a Context

Using this factory ensures consistent Robot creation throughout the application and handles the necessary URL transformations.

robot(context) staticmethod

Create a Robot instance from a Context.

Extracts the URL from the context's resource, converts it to a robots.txt URL, and creates a new Robot instance bound to that URL and the provided context.

Parameters:

Name Type Description Default
context Context

The context to use for the Robot

required

Returns:

Type Description
Robot

A fully initialized Robot instance

Raises:

Type Description
TypeError

If context is not a Context instance

Example

from ethicrawl.context import Context from ethicrawl.core import Resource from ethicrawl.robots import RobotFactory ctx = Context(Resource("https://example.com")) robot = RobotFactory.robot(ctx) robot.can_fetch("https://example.com/page") True

robotify(url) staticmethod

Convert a site URL to its corresponding robots.txt URL.

Takes any URL and transforms it to the canonical robots.txt URL by extracting the base URL and appending "robots.txt".

Parameters:

Name Type Description Default
url Url

The site URL to convert

required

Returns:

Type Description
Url

A new Url pointing to the site's robots.txt file

Raises:

Type Description
TypeError

If url is not a Url instance

Example

from ethicrawl.core import Url from ethicrawl.robots import RobotFactory site = Url("https://example.com/some/page") robots_url = RobotFactory.robotify(site) str(robots_url) 'https://example.com/robots.txt'