Skip to content

Ethicrawl

python license pytest codecov security PyPI docs

Ethicrawl is a Python library for ethical, professional-grade web crawling. It automatically respects robots.txt, enforces rate limits, and offers robust sitemap parsing and domain control—making it easy to build reliable and responsible crawlers.

Project Goals

Ethicrawl is built on the principle that web crawling should be:

  • Ethical by Design: Automatically respects robots.txt and rate limits, ensuring responsible web crawling.
  • Server-Safe: Prevents accidental overloading with built-in safeguards.
  • Feature-Rich: Includes robust sitemap parsing, domain control, and flexible configuration.
  • Extensible & Customizable: Easily adapts to diverse crawling needs through flexible settings and clean architecture.

Key Features

  • Robots.txt Compliance: Automatic parsing and enforcement of robots.txt rules
  • Rate Limiting: Built-in, configurable request rate management
  • Sitemap Support: Parse and filter XML sitemaps to discover content
  • Domain Control: Explicit whitelisting for cross-domain access
  • Flexible Configuration: Easily configure all aspects of crawling behavior

Documentation

Comprehensive documentation is available at https://ethicrawl.github.io/ethicrawl/

Installation

Install the latest version from PyPI:

pip install ethicrawl

For development:

# Clone the repository
git clone https://github.com/ethicrawl/ethicrawl.git

# Navigate to the directory
cd ethicrawl

# Create and activate a virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate  # Linux/Mac
venv\Scripts\activate     # Windows

# Install in development mode
pip install -e .

Quick Start

from ethicrawl import Ethicrawl
from ethicrawl.error import RobotDisallowedError

# Create and bind to a domain
ethicrawl = Ethicrawl()
ethicrawl.bind("https://example.com")

# Get a page - robots.txt rules automatically respected
try:
    response = ethicrawl.get("https://example.com/page.html")
except RobotDisallowedError:
    print("The site prohibits fetching the page")

# Release resources when done
ethicrawl.unbind()

License

Apache 2.0 License - See LICENSE file for details.