Skip to content

Sitemaps Module

XML sitemap parsing and traversal for discovering website structure.

IndexDocument

Bases: SitemapDocument

Specialized parser for sitemap index documents.

This class extends the SitemapDocument class to handle sitemap indexes, which are XML documents containing references to other sitemap files. It validates that the document is a proper sitemap index and extracts all sitemap references as IndexEntry objects.

IndexDocument enforces type safety for its entries collection, ensuring that only IndexEntry objects can be added.

Attributes:

Name Type Description
entries ResourceList

ResourceList of IndexEntry objects representing sitemap references

Example

from ethicrawl.context import Context from ethicrawl.core import Resource from ethicrawl.sitemaps import IndexDocument context = Context(Resource("https://example.com")) sitemap_xml = '''<?xml version="1.0" encoding="UTF-8"?> ... ... ... https://example.com/sitemap1.xml ... 2023-06-15T14:30:00Z ... ... ''' index = IndexDocument(context, sitemap_xml) len(index.entries) 1 str(index.entries[0]) 'https://example.com/sitemap1.xml (last modified: 2023-06-15T14:30:00Z)'

entries property writable

Get the sitemaps in this index.

Returns:

Type Description
ResourceList

ResourceList of IndexEntry objects representing sitemap references

__init__(context, document=None)

Initialize a sitemap index document parser.

Parameters:

Name Type Description Default
context Context

Context for logging and resource resolution

required
document str | None

Optional XML sitemap index content to parse

None

Raises:

Type Description
ValueError

If the document is not a valid sitemap index

SitemapError

If the document cannot be parsed

IndexEntry dataclass

Bases: SitemapEntry

Represents an entry in a sitemap index file.

IndexEntry specializes SitemapEntry for use in sitemap index files. Sitemap indexes are XML files that contain references to other sitemap files, allowing websites to organize their sitemaps hierarchically.

IndexEntry maintains the same attributes as SitemapEntry (url and lastmod) but provides specialized string representation appropriate for index entries.

Attributes:

Name Type Description
url Url

URL of the sitemap file (inherited from Resource)

lastmod str | None

Last modification date of the sitemap (inherited from SitemapEntry)

Example

from ethicrawl.core import Url from ethicrawl.sitemaps import IndexEntry index = IndexEntry( ... Url("https://example.com/sitemap-products.xml"), ... lastmod="2023-06-15T14:30:00Z" ... ) str(index) 'https://example.com/sitemap-products.xml (last modified: 2023-06-15T14:30:00Z)' repr(index) "SitemapIndexEntry(url='https://example.com/sitemap-products.xml', lastmod='2023-06-15T14:30:00Z')"

__repr__()

Detailed representation for debugging.

Returns:

Type Description
str

String representation showing class name and field values

SitemapDocument

Parser and representation of XML sitemap documents.

This class handles the parsing, validation, and extraction of entries from XML sitemap documents, supporting both sitemap index files and urlset files. It implements security best practices for XML parsing to prevent common vulnerabilities like XXE attacks.

SitemapDocument distinguishes between sitemap indexes (which contain references to other sitemaps) and urlsets (which contain actual page URLs), extracting the appropriate entries in each case.

Attributes:

Name Type Description
SITEMAP_NS

The official sitemap namespace URI

entries ResourceList

ResourceList containing the parsed sitemap entries

type str

The type of sitemap (sitemapindex, urlset, or unsupported)

Example

from ethicrawl.context import Context from ethicrawl.core import Resource from ethicrawl.sitemaps import SitemapDocument context = Context(Resource("https://example.com")) sitemap_xml = '''<?xml version="1.0" encoding="UTF-8"?> ... ... ... https://example.com/page1 ... 2023-06-15T14:30:00Z ... ... ''' sitemap = SitemapDocument(context, sitemap_xml) sitemap.type 'urlset' len(sitemap.entries) 1

entries property

Get the list of entries extracted from the sitemap.

Returns:

Type Description
ResourceList

ResourceList containing SitemapEntry objects (either IndexEntry

ResourceList

or UrlsetEntry depending on the sitemap type)

type property

Get the type of sitemap document.

Determines the type based on the root element's local name.

Returns:

Type Description
str

String indicating the sitemap type:

str
  • 'sitemapindex': A sitemap index containing references to other sitemaps
str
  • 'urlset': A sitemap containing page URLs
str
  • 'unsupported': Any other type of document

Raises:

Type Description
SitemapError

If no document has been loaded

__init__(context, document=None)

Initialize a sitemap document parser with security protections.

Sets up the XML parser with security features to prevent common XML vulnerabilities and optionally parses a provided document.

Parameters:

Name Type Description Default
context Context

Context for logging and resource resolution

required
document str | None

Optional XML sitemap content to parse immediately

None

Raises:

Type Description
SitemapError

If the provided document cannot be parsed

SitemapEntry dataclass

Bases: Resource

Base class for entries found in XML sitemaps.

SitemapEntry extends Resource to represent entries from XML sitemaps with their additional metadata. It maintains the URL identity pattern while adding sitemap-specific attributes like the last modification date.

This class handles validation of W3C datetime formats and provides appropriate string representation of sitemap entries.

Attributes:

Name Type Description
url Url

URL of the sitemap entry (inherited from Resource)

lastmod str | None

Last modification date string in W3C format (optional)

Example

from ethicrawl.core import Url from ethicrawl.sitemaps import SitemapEntry entry = SitemapEntry( ... Url("https://example.com/page1"), ... lastmod="2023-06-15T14:30:00Z" ... ) str(entry) 'https://example.com/page1 (last modified: 2023-06-15T14:30:00Z)'

__post_init__()

Validate fields after initialization.

Validates the lastmod date format if provided, ensuring it conforms to one of the accepted W3C datetime formats.

Raises:

Type Description
ValueError

If lastmod format is invalid

TypeError

If lastmod is not a string

__str__()

Human-readable string representation of the sitemap entry.

Returns:

Type Description
str

URL string with last modification date if available

SitemapParser

Recursive parser for extracting URLs from sitemap documents.

This class handles the traversal of sitemap structures, including nested sitemap indexes, to extract all page URLs. It implements:

  • Recursive traversal of sitemap indexes
  • Depth limiting to prevent excessive recursion
  • Cycle detection to prevent infinite loops
  • URL deduplication
  • Multiple input formats (IndexDocument, ResourceList, etc.)

Attributes:

Name Type Description
context

Context with client for fetching sitemaps and logging

Example

from ethicrawl.context import Context from ethicrawl.core import Resource, Url from ethicrawl.sitemaps import SitemapParser context = Context(Resource(Url("https://example.com"))) parser = SitemapParser(context)

Parse from a single sitemap URL

urls = parser.parse([Resource(Url("https://example.com/sitemap.xml"))]) print(f"Found {len(urls)} URLs in sitemap")

__init__(context)

Initialize the sitemap parser.

Parameters:

Name Type Description Default
context Context

Context with client for fetching sitemaps and logging

required

parse(root=None)

Parse sitemap(s) and extract all contained URLs.

This is the main entry point for sitemap parsing. It accepts various input formats and recursively extracts all URLs from the sitemap(s).

Parameters:

Name Type Description Default
root IndexDocument | ResourceList | list[Resource] | None

Source to parse, which can be: - IndexDocument: Pre-parsed sitemap index - ResourceList: List of resources to fetch as sitemaps - list[Resource]: List of resources to fetch as sitemaps - None: Use the context's base URL for robots.txt discovery

None

Returns:

Type Description
ResourceList

ResourceList containing all page URLs found in the sitemap(s)

Raises:

Type Description
SitemapError

If a sitemap cannot be fetched or parsed

UrlsetDocument

Bases: SitemapDocument

Specialized parser for sitemap urlset documents.

This class extends SitemapDocument to handle urlset sitemaps, which contain page URLs with metadata like change frequency and priority. It validates that the document is a proper urlset and extracts all URL references as UrlsetEntry objects.

UrlsetDocument supports only the core sitemap protocol specification elements (loc, lastmod, changefreq, priority) and does not handle any sitemap protocol extensions.

Attributes:

Name Type Description
entries ResourceList

ResourceList of UrlsetEntry objects representing page URLs

Example

from ethicrawl.context import Context from ethicrawl.core import Resource from ethicrawl.sitemaps import UrlsetDocument context = Context(Resource("https://example.com")) sitemap_xml = '''<?xml version="1.0" encoding="UTF-8"?> ... ... ... https://example.com/page1 ... 2023-06-15T14:30:00Z ... weekly ... 0.8 ... ... ''' urlset = UrlsetDocument(context, sitemap_xml) len(urlset.entries) 1 entry = urlset.entries[0] entry.priority '0.8'

entries property writable

Get the URLs in this urlset.

Returns:

Type Description
ResourceList

ResourceList of UrlsetEntry objects representing page URLs

__init__(context, document=None)

Initialize a urlset sitemap document parser.

Parameters:

Name Type Description Default
context Context

Context for logging and resource resolution

required
document str | None

Optional XML urlset content to parse

None

Raises:

Type Description
ValueError

If the document is not a valid urlset

SitemapError

If the document cannot be parsed

UrlsetEntry dataclass

Bases: SitemapEntry

Represents an entry in a sitemap urlset file.

UrlsetEntry specializes SitemapEntry for standard sitemap URL entries that contain page URLs with metadata. These entries represent actual content pages on a website, as opposed to index entries that point to other sitemap files.

In addition to the URL and lastmod attributes inherited from SitemapEntry, UrlsetEntry adds support for: - changefreq: How frequently the page is likely to change - priority: Relative importance of this URL (0.0-1.0)

All attributes are validated during initialization to ensure they conform to the sitemap protocol specification.

Attributes:

Name Type Description
url Url

URL of the page (inherited from Resource)

lastmod str | None

Last modification date (inherited from SitemapEntry)

changefreq str | None

Update frequency (always, hourly, daily, weekly, etc.)

priority float | str | None

Relative importance value from 0.0 to 1.0

Example

from ethicrawl.core import Url from ethicrawl.sitemaps import UrlsetEntry entry = UrlsetEntry( ... Url("https://example.com/page1"), ... lastmod="2023-06-15T14:30:00Z", ... changefreq="weekly", ... priority=0.8 ... ) str(entry) 'https://example.com/page1 | last modified: 2023-06-15T14:30:00Z | frequency: weekly | priority: 0.8'

__post_init__()

Validate fields after initialization.

Calls the parent class validation, then validates and normalizes the changefreq and priority attributes if they are provided.

Raises:

Type Description
ValueError

If any field contains invalid values

TypeError

If any field has an incorrect type

__repr__()

Detailed representation for debugging.

Returns:

Type Description
str

String representation showing class name and all field values

__str__()

Human-readable string representation.

Creates a pipe-separated string containing the URL and any available metadata (lastmod, changefreq, priority).

Returns:

Type Description
str

Formatted string with URL and metadata