`sitemap.xml` Generator

Phase 1 — Core crawling logic

Implement a function that starts from a given webpage (root URL) and explores other pages linked from it, up to a certain depth (how many links away from the start page you go).
Keep track of pages already visited so the program doesn’t check the same page twice.
Check that each URL is valid and handle errors (like missing pages or connection problems) without the program crashing.

Functional result: given a starting webpage and a depth limit, the program produces a list of unique webpages it found within the site, ignoring broken links or repeated pages.

Phase 2 — Domain-limited crawl & metadata

Restrict crawling to the same domain as the root URL.
For each page, collect/estimate metadata required for sitemap.xml:
- loc (the page URL)
- lastmod (from HTTP Last-Modified header, if available -- if not, use date of crawling)
- changefreq (estimated using above Last-Modified header; daily if modified under 1 day before, weekly if under 1 week, etc.)
- priority (calculated based on depth; the deeper the page, the lower the priority)
Handle errors like timeouts, redirects, and 404 pages.

Functional result: program outputs a structured list of URLs with all required metadata for a valid sitemap, ignoring external domains.

Phase 3 — Sitemap XML generation

Convert the crawled pages into a valid sitemap.xml format (as per the Sitemaps XML format), using the collected metadata.
Ensure the XML validates against the standard sitemap schema.
Write the XML to a file.

Functional result: running the program produces a valid sitemap.xml file with all crawled pages and their metadata.

Phase 4 — Robustness & logging

Handle network exceptions, invalid URLs, and write permission errors gracefully.
Ensure crawl queue does not duplicate URLs.
Log progress to the console or a log file: number of pages crawled, current URL, and any skipped or failed URLs.
Ensure sitemap generation completes reliably even if some pages fail.

Functional result: sitemap generation completes without crashing, all URLs are processed correctly up to the maximum depth, priority values correspond to depth, and progress is clearly logged during execution.

sitemap.xml Generator

Phase 1 — Core crawling logic

Phase 2 — Domain-limited crawl & metadata

Phase 3 — Sitemap XML generation

Phase 4 — Robustness & logging

`sitemap.xml` Generator