Sitemaps Module

XML sitemap parsing and traversal for discovering website structure.

`IndexDocument`

Bases: SitemapDocument

Specialized parser for sitemap index documents.

This class extends the SitemapDocument class to handle sitemap indexes, which are XML documents containing references to other sitemap files. It validates that the document is a proper sitemap index and extracts all sitemap references as IndexEntry objects.

IndexDocument enforces type safety for its entries collection, ensuring that only IndexEntry objects can be added.

Attributes:

Name	Type	Description
`entries`	`ResourceList`	ResourceList of IndexEntry objects representing sitemap references

Example

from ethicrawl.context import Context from ethicrawl.core import Resource from ethicrawl.sitemaps import IndexDocument context = Context(Resource("https://example.com")) sitemap_xml = '''<?xml version="1.0" encoding="UTF-8"?> ... ... ... https://example.com/sitemap1.xml ... 2023-06-15T14:30:00Z ... ... ''' index = IndexDocument(context, sitemap_xml) len(index.entries) 1 str(index.entries[0]) 'https://example.com/sitemap1.xml (last modified: 2023-06-15T14:30:00Z)'

`entries` `property` `writable`

Get the sitemaps in this index.

Returns:

Type	Description
`ResourceList`	ResourceList of IndexEntry objects representing sitemap references

`init(context, document=None)`

Initialize a sitemap index document parser.

Parameters:

Name	Type	Description	Default
`context`	`Context`	Context for logging and resource resolution	required
`document`	`str \| None`	Optional XML sitemap index content to parse	`None`

Raises:

Type	Description
`ValueError`	If the document is not a valid sitemap index
`SitemapError`	If the document cannot be parsed

`IndexEntry` `dataclass`

Bases: SitemapEntry

Represents an entry in a sitemap index file.

IndexEntry specializes SitemapEntry for use in sitemap index files. Sitemap indexes are XML files that contain references to other sitemap files, allowing websites to organize their sitemaps hierarchically.

IndexEntry maintains the same attributes as SitemapEntry (url and lastmod) but provides specialized string representation appropriate for index entries.

Attributes:

Name	Type	Description
`url`	`Url`	URL of the sitemap file (inherited from Resource)
`lastmod`	`str \| None`	Last modification date of the sitemap (inherited from SitemapEntry)

Example

from ethicrawl.core import Url from ethicrawl.sitemaps import IndexEntry index = IndexEntry( ... Url("https://example.com/sitemap-products.xml"), ... lastmod="2023-06-15T14:30:00Z" ... ) str(index) 'https://example.com/sitemap-products.xml (last modified: 2023-06-15T14:30:00Z)' repr(index) "SitemapIndexEntry(url='https://example.com/sitemap-products.xml', lastmod='2023-06-15T14:30:00Z')"

`repr()`

Detailed representation for debugging.

Returns:

Type	Description
`str`	String representation showing class name and field values

`SitemapDocument`

Parser and representation of XML sitemap documents.

This class handles the parsing, validation, and extraction of entries from XML sitemap documents, supporting both sitemap index files and urlset files. It implements security best practices for XML parsing to prevent common vulnerabilities like XXE attacks.

SitemapDocument distinguishes between sitemap indexes (which contain references to other sitemaps) and urlsets (which contain actual page URLs), extracting the appropriate entries in each case.

Attributes:

Name	Type	Description
`SITEMAP_NS`		The official sitemap namespace URI
`entries`	`ResourceList`	ResourceList containing the parsed sitemap entries
`type`	`str`	The type of sitemap (sitemapindex, urlset, or unsupported)

Example

from ethicrawl.context import Context from ethicrawl.core import Resource from ethicrawl.sitemaps import SitemapDocument context = Context(Resource("https://example.com")) sitemap_xml = '''<?xml version="1.0" encoding="UTF-8"?> ... ... ... https://example.com/page1 ... 2023-06-15T14:30:00Z ... ... ''' sitemap = SitemapDocument(context, sitemap_xml) sitemap.type 'urlset' len(sitemap.entries) 1

`entries` `property`

Get the list of entries extracted from the sitemap.

Returns:

Type	Description
`ResourceList`	ResourceList containing SitemapEntry objects (either IndexEntry
`ResourceList`	or UrlsetEntry depending on the sitemap type)

`type` `property`

Get the type of sitemap document.

Determines the type based on the root element's local name.

Returns:

Type	Description
`str`	String indicating the sitemap type:
`str`	'sitemapindex': A sitemap index containing references to other sitemaps
`str`	'urlset': A sitemap containing page URLs
`str`	'unsupported': Any other type of document

Raises:

Type	Description
`SitemapError`	If no document has been loaded

`init(context, document=None)`

Initialize a sitemap document parser with security protections.

Sets up the XML parser with security features to prevent common XML vulnerabilities and optionally parses a provided document.

Parameters:

Name	Type	Description	Default
`context`	`Context`	Context for logging and resource resolution	required
`document`	`str \| None`	Optional XML sitemap content to parse immediately	`None`

Raises:

Type	Description
`SitemapError`	If the provided document cannot be parsed

`SitemapEntry` `dataclass`

Bases: Resource

Base class for entries found in XML sitemaps.

SitemapEntry extends Resource to represent entries from XML sitemaps with their additional metadata. It maintains the URL identity pattern while adding sitemap-specific attributes like the last modification date.

This class handles validation of W3C datetime formats and provides appropriate string representation of sitemap entries.

Attributes:

Name	Type	Description
`url`	`Url`	URL of the sitemap entry (inherited from Resource)
`lastmod`	`str \| None`	Last modification date string in W3C format (optional)

Example

from ethicrawl.core import Url from ethicrawl.sitemaps import SitemapEntry entry = SitemapEntry( ... Url("https://example.com/page1"), ... lastmod="2023-06-15T14:30:00Z" ... ) str(entry) 'https://example.com/page1 (last modified: 2023-06-15T14:30:00Z)'

`__post_init__()`

Validate fields after initialization.

Validates the lastmod date format if provided, ensuring it conforms to one of the accepted W3C datetime formats.

Raises:

Type	Description
`ValueError`	If lastmod format is invalid
`TypeError`	If lastmod is not a string

`str()`

Human-readable string representation of the sitemap entry.

Returns:

Type	Description
`str`	URL string with last modification date if available

`SitemapParser`

Recursive parser for extracting URLs from sitemap documents.

This class handles the traversal of sitemap structures, including nested sitemap indexes, to extract all page URLs. It implements:

Recursive traversal of sitemap indexes
Depth limiting to prevent excessive recursion
Cycle detection to prevent infinite loops
URL deduplication
Multiple input formats (IndexDocument, ResourceList, etc.)

Attributes:

Name	Type	Description
`context`		Context with client for fetching sitemaps and logging

Example

from ethicrawl.context import Context from ethicrawl.core import Resource, Url from ethicrawl.sitemaps import SitemapParser context = Context(Resource(Url("https://example.com"))) parser = SitemapParser(context)

Parse from a single sitemap URL

urls = parser.parse([Resource(Url("https://example.com/sitemap.xml"))]) print(f"Found {len(urls)} URLs in sitemap")

`init(context)`

Initialize the sitemap parser.

Parameters:

Name	Type	Description	Default
`context`	`Context`	Context with client for fetching sitemaps and logging	required

`parse(root=None)`

Parse sitemap(s) and extract all contained URLs.

This is the main entry point for sitemap parsing. It accepts various input formats and recursively extracts all URLs from the sitemap(s).

Parameters:

Name	Type	Description	Default
`root`	`IndexDocument \| ResourceList \| list[Resource] \| None`	Source to parse, which can be: - IndexDocument: Pre-parsed sitemap index - ResourceList: List of resources to fetch as sitemaps - list[Resource]: List of resources to fetch as sitemaps - None: Use the context's base URL for robots.txt discovery	`None`

Returns:

Type	Description
`ResourceList`	ResourceList containing all page URLs found in the sitemap(s)

Raises:

Type	Description
`SitemapError`	If a sitemap cannot be fetched or parsed

`UrlsetDocument`

Bases: SitemapDocument

Specialized parser for sitemap urlset documents.

This class extends SitemapDocument to handle urlset sitemaps, which contain page URLs with metadata like change frequency and priority. It validates that the document is a proper urlset and extracts all URL references as UrlsetEntry objects.

UrlsetDocument supports only the core sitemap protocol specification elements (loc, lastmod, changefreq, priority) and does not handle any sitemap protocol extensions.

Attributes:

Name	Type	Description
`entries`	`ResourceList`	ResourceList of UrlsetEntry objects representing page URLs

Example

from ethicrawl.context import Context from ethicrawl.core import Resource from ethicrawl.sitemaps import UrlsetDocument context = Context(Resource("https://example.com")) sitemap_xml = '''<?xml version="1.0" encoding="UTF-8"?> ... ... ... https://example.com/page1 ... 2023-06-15T14:30:00Z ... weekly ... 0.8 ... ... ''' urlset = UrlsetDocument(context, sitemap_xml) len(urlset.entries) 1 entry = urlset.entries[0] entry.priority '0.8'

`entries` `property` `writable`

Get the URLs in this urlset.

Returns:

Type	Description
`ResourceList`	ResourceList of UrlsetEntry objects representing page URLs

`init(context, document=None)`

Initialize a urlset sitemap document parser.

Parameters:

Name	Type	Description	Default
`context`	`Context`	Context for logging and resource resolution	required
`document`	`str \| None`	Optional XML urlset content to parse	`None`

Raises:

Type	Description
`ValueError`	If the document is not a valid urlset
`SitemapError`	If the document cannot be parsed

`UrlsetEntry` `dataclass`

Bases: SitemapEntry

Represents an entry in a sitemap urlset file.

UrlsetEntry specializes SitemapEntry for standard sitemap URL entries that contain page URLs with metadata. These entries represent actual content pages on a website, as opposed to index entries that point to other sitemap files.

In addition to the URL and lastmod attributes inherited from SitemapEntry, UrlsetEntry adds support for: - changefreq: How frequently the page is likely to change - priority: Relative importance of this URL (0.0-1.0)

All attributes are validated during initialization to ensure they conform to the sitemap protocol specification.

Attributes:

Name	Type	Description
`url`	`Url`	URL of the page (inherited from Resource)
`lastmod`	`str \| None`	Last modification date (inherited from SitemapEntry)
`changefreq`	`str \| None`	Update frequency (always, hourly, daily, weekly, etc.)
`priority`	`float \| str \| None`	Relative importance value from 0.0 to 1.0

Example

from ethicrawl.core import Url from ethicrawl.sitemaps import UrlsetEntry entry = UrlsetEntry( ... Url("https://example.com/page1"), ... lastmod="2023-06-15T14:30:00Z", ... changefreq="weekly", ... priority=0.8 ... ) str(entry) 'https://example.com/page1 | last modified: 2023-06-15T14:30:00Z | frequency: weekly | priority: 0.8'

`__post_init__()`

Validate fields after initialization.

Calls the parent class validation, then validates and normalizes the changefreq and priority attributes if they are provided.

Raises:

Type	Description
`ValueError`	If any field contains invalid values
`TypeError`	If any field has an incorrect type

`repr()`

Detailed representation for debugging.

Returns:

Type	Description
`str`	String representation showing class name and all field values

`str()`

Human-readable string representation.

Creates a pipe-separated string containing the URL and any available metadata (lastmod, changefreq, priority).

Returns:

Type	Description
`str`	Formatted string with URL and metadata

Sitemaps Module

IndexDocument

entries property writable

__init__(context, document=None)

IndexEntry dataclass

__repr__()

SitemapDocument

entries property

type property

__init__(context, document=None)

SitemapEntry dataclass

__post_init__()

__str__()

SitemapParser

Parse from a single sitemap URL

__init__(context)

parse(root=None)

UrlsetDocument

entries property writable

__init__(context, document=None)

UrlsetEntry dataclass

__post_init__()

__repr__()

__str__()

`IndexDocument`

`entries` `property` `writable`

`init(context, document=None)`

`IndexEntry` `dataclass`

`repr()`

`SitemapDocument`

`entries` `property`

`type` `property`

`init(context, document=None)`

`SitemapEntry` `dataclass`

`__post_init__()`

`str()`

`SitemapParser`

`init(context)`

`parse(root=None)`

`UrlsetDocument`

`entries` `property` `writable`

`init(context, document=None)`

`UrlsetEntry` `dataclass`

`__post_init__()`

`repr()`

`str()`