Sitemaps Module
XML sitemap parsing and traversal for discovering website structure.
IndexDocument
Bases: SitemapDocument
Specialized parser for sitemap index documents.
This class extends the SitemapDocument class to handle sitemap indexes, which are XML documents containing references to other sitemap files. It validates that the document is a proper sitemap index and extracts all sitemap references as IndexEntry objects.
IndexDocument enforces type safety for its entries collection, ensuring that only IndexEntry objects can be added.
Attributes:
| Name | Type | Description |
|---|---|---|
entries |
ResourceList
|
ResourceList of IndexEntry objects representing sitemap references |
Example
from ethicrawl.context import Context from ethicrawl.core import Resource from ethicrawl.sitemaps import IndexDocument context = Context(Resource("https://example.com")) sitemap_xml = '''<?xml version="1.0" encoding="UTF-8"?> ...
... ''' index = IndexDocument(context, sitemap_xml) len(index.entries) 1 str(index.entries[0]) 'https://example.com/sitemap1.xml (last modified: 2023-06-15T14:30:00Z)'... ...https://example.com/sitemap1.xml ...2023-06-15T14:30:00Z ...
entries
property
writable
Get the sitemaps in this index.
Returns:
| Type | Description |
|---|---|
ResourceList
|
ResourceList of IndexEntry objects representing sitemap references |
__init__(context, document=None)
Initialize a sitemap index document parser.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
Context for logging and resource resolution |
required |
document
|
str | None
|
Optional XML sitemap index content to parse |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If the document is not a valid sitemap index |
SitemapError
|
If the document cannot be parsed |
IndexEntry
dataclass
Bases: SitemapEntry
Represents an entry in a sitemap index file.
IndexEntry specializes SitemapEntry for use in sitemap index files. Sitemap indexes are XML files that contain references to other sitemap files, allowing websites to organize their sitemaps hierarchically.
IndexEntry maintains the same attributes as SitemapEntry (url and lastmod) but provides specialized string representation appropriate for index entries.
Attributes:
| Name | Type | Description |
|---|---|---|
url |
Url
|
URL of the sitemap file (inherited from Resource) |
lastmod |
str | None
|
Last modification date of the sitemap (inherited from SitemapEntry) |
Example
from ethicrawl.core import Url from ethicrawl.sitemaps import IndexEntry index = IndexEntry( ... Url("https://example.com/sitemap-products.xml"), ... lastmod="2023-06-15T14:30:00Z" ... ) str(index) 'https://example.com/sitemap-products.xml (last modified: 2023-06-15T14:30:00Z)' repr(index) "SitemapIndexEntry(url='https://example.com/sitemap-products.xml', lastmod='2023-06-15T14:30:00Z')"
__repr__()
Detailed representation for debugging.
Returns:
| Type | Description |
|---|---|
str
|
String representation showing class name and field values |
SitemapDocument
Parser and representation of XML sitemap documents.
This class handles the parsing, validation, and extraction of entries from XML sitemap documents, supporting both sitemap index files and urlset files. It implements security best practices for XML parsing to prevent common vulnerabilities like XXE attacks.
SitemapDocument distinguishes between sitemap indexes (which contain references to other sitemaps) and urlsets (which contain actual page URLs), extracting the appropriate entries in each case.
Attributes:
| Name | Type | Description |
|---|---|---|
SITEMAP_NS |
The official sitemap namespace URI |
|
entries |
ResourceList
|
ResourceList containing the parsed sitemap entries |
type |
str
|
The type of sitemap (sitemapindex, urlset, or unsupported) |
Example
from ethicrawl.context import Context from ethicrawl.core import Resource from ethicrawl.sitemaps import SitemapDocument context = Context(Resource("https://example.com")) sitemap_xml = '''<?xml version="1.0" encoding="UTF-8"?> ...
... ''' sitemap = SitemapDocument(context, sitemap_xml) sitemap.type 'urlset' len(sitemap.entries) 1... ...https://example.com/page1 ...2023-06-15T14:30:00Z ...
entries
property
Get the list of entries extracted from the sitemap.
Returns:
| Type | Description |
|---|---|
ResourceList
|
ResourceList containing SitemapEntry objects (either IndexEntry |
ResourceList
|
or UrlsetEntry depending on the sitemap type) |
type
property
Get the type of sitemap document.
Determines the type based on the root element's local name.
Returns:
| Type | Description |
|---|---|
str
|
String indicating the sitemap type: |
str
|
|
str
|
|
str
|
|
Raises:
| Type | Description |
|---|---|
SitemapError
|
If no document has been loaded |
__init__(context, document=None)
Initialize a sitemap document parser with security protections.
Sets up the XML parser with security features to prevent common XML vulnerabilities and optionally parses a provided document.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
Context for logging and resource resolution |
required |
document
|
str | None
|
Optional XML sitemap content to parse immediately |
None
|
Raises:
| Type | Description |
|---|---|
SitemapError
|
If the provided document cannot be parsed |
SitemapEntry
dataclass
Bases: Resource
Base class for entries found in XML sitemaps.
SitemapEntry extends Resource to represent entries from XML sitemaps with their additional metadata. It maintains the URL identity pattern while adding sitemap-specific attributes like the last modification date.
This class handles validation of W3C datetime formats and provides appropriate string representation of sitemap entries.
Attributes:
| Name | Type | Description |
|---|---|---|
url |
Url
|
URL of the sitemap entry (inherited from Resource) |
lastmod |
str | None
|
Last modification date string in W3C format (optional) |
Example
from ethicrawl.core import Url from ethicrawl.sitemaps import SitemapEntry entry = SitemapEntry( ... Url("https://example.com/page1"), ... lastmod="2023-06-15T14:30:00Z" ... ) str(entry) 'https://example.com/page1 (last modified: 2023-06-15T14:30:00Z)'
__post_init__()
Validate fields after initialization.
Validates the lastmod date format if provided, ensuring it conforms to one of the accepted W3C datetime formats.
Raises:
| Type | Description |
|---|---|
ValueError
|
If lastmod format is invalid |
TypeError
|
If lastmod is not a string |
__str__()
Human-readable string representation of the sitemap entry.
Returns:
| Type | Description |
|---|---|
str
|
URL string with last modification date if available |
SitemapParser
Recursive parser for extracting URLs from sitemap documents.
This class handles the traversal of sitemap structures, including nested sitemap indexes, to extract all page URLs. It implements:
- Recursive traversal of sitemap indexes
- Depth limiting to prevent excessive recursion
- Cycle detection to prevent infinite loops
- URL deduplication
- Multiple input formats (IndexDocument, ResourceList, etc.)
Attributes:
| Name | Type | Description |
|---|---|---|
context |
Context with client for fetching sitemaps and logging |
Example
from ethicrawl.context import Context from ethicrawl.core import Resource, Url from ethicrawl.sitemaps import SitemapParser context = Context(Resource(Url("https://example.com"))) parser = SitemapParser(context)
Parse from a single sitemap URL
urls = parser.parse([Resource(Url("https://example.com/sitemap.xml"))]) print(f"Found {len(urls)} URLs in sitemap")
__init__(context)
Initialize the sitemap parser.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
Context with client for fetching sitemaps and logging |
required |
parse(root=None)
Parse sitemap(s) and extract all contained URLs.
This is the main entry point for sitemap parsing. It accepts various input formats and recursively extracts all URLs from the sitemap(s).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
root
|
IndexDocument | ResourceList | list[Resource] | None
|
Source to parse, which can be: - IndexDocument: Pre-parsed sitemap index - ResourceList: List of resources to fetch as sitemaps - list[Resource]: List of resources to fetch as sitemaps - None: Use the context's base URL for robots.txt discovery |
None
|
Returns:
| Type | Description |
|---|---|
ResourceList
|
ResourceList containing all page URLs found in the sitemap(s) |
Raises:
| Type | Description |
|---|---|
SitemapError
|
If a sitemap cannot be fetched or parsed |
UrlsetDocument
Bases: SitemapDocument
Specialized parser for sitemap urlset documents.
This class extends SitemapDocument to handle urlset sitemaps, which contain page URLs with metadata like change frequency and priority. It validates that the document is a proper urlset and extracts all URL references as UrlsetEntry objects.
UrlsetDocument supports only the core sitemap protocol specification elements (loc, lastmod, changefreq, priority) and does not handle any sitemap protocol extensions.
Attributes:
| Name | Type | Description |
|---|---|---|
entries |
ResourceList
|
ResourceList of UrlsetEntry objects representing page URLs |
Example
from ethicrawl.context import Context from ethicrawl.core import Resource from ethicrawl.sitemaps import UrlsetDocument context = Context(Resource("https://example.com")) sitemap_xml = '''<?xml version="1.0" encoding="UTF-8"?> ...
... ''' urlset = UrlsetDocument(context, sitemap_xml) len(urlset.entries) 1 entry = urlset.entries[0] entry.priority '0.8'... ...https://example.com/page1 ...2023-06-15T14:30:00Z ...weekly ...0.8 ...
entries
property
writable
Get the URLs in this urlset.
Returns:
| Type | Description |
|---|---|
ResourceList
|
ResourceList of UrlsetEntry objects representing page URLs |
__init__(context, document=None)
Initialize a urlset sitemap document parser.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
Context for logging and resource resolution |
required |
document
|
str | None
|
Optional XML urlset content to parse |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If the document is not a valid urlset |
SitemapError
|
If the document cannot be parsed |
UrlsetEntry
dataclass
Bases: SitemapEntry
Represents an entry in a sitemap urlset file.
UrlsetEntry specializes SitemapEntry for standard sitemap URL entries that contain page URLs with metadata. These entries represent actual content pages on a website, as opposed to index entries that point to other sitemap files.
In addition to the URL and lastmod attributes inherited from SitemapEntry, UrlsetEntry adds support for: - changefreq: How frequently the page is likely to change - priority: Relative importance of this URL (0.0-1.0)
All attributes are validated during initialization to ensure they conform to the sitemap protocol specification.
Attributes:
| Name | Type | Description |
|---|---|---|
url |
Url
|
URL of the page (inherited from Resource) |
lastmod |
str | None
|
Last modification date (inherited from SitemapEntry) |
changefreq |
str | None
|
Update frequency (always, hourly, daily, weekly, etc.) |
priority |
float | str | None
|
Relative importance value from 0.0 to 1.0 |
Example
from ethicrawl.core import Url from ethicrawl.sitemaps import UrlsetEntry entry = UrlsetEntry( ... Url("https://example.com/page1"), ... lastmod="2023-06-15T14:30:00Z", ... changefreq="weekly", ... priority=0.8 ... ) str(entry) 'https://example.com/page1 | last modified: 2023-06-15T14:30:00Z | frequency: weekly | priority: 0.8'
__post_init__()
Validate fields after initialization.
Calls the parent class validation, then validates and normalizes the changefreq and priority attributes if they are provided.
Raises:
| Type | Description |
|---|---|
ValueError
|
If any field contains invalid values |
TypeError
|
If any field has an incorrect type |
__repr__()
Detailed representation for debugging.
Returns:
| Type | Description |
|---|---|
str
|
String representation showing class name and all field values |
__str__()
Human-readable string representation.
Creates a pipe-separated string containing the URL and any available metadata (lastmod, changefreq, priority).
Returns:
| Type | Description |
|---|---|
str
|
Formatted string with URL and metadata |