sitemap.xml Generator

Phase 1 — Core crawling logic

  • Implement a function that starts from a given webpage (root URL) and explores other pages linked from it, up to a certain depth (how many links away from the start page you go).
  • Keep track of pages already visited so the program doesn’t check the same page twice.
  • Check that each URL is valid and handle errors (like missing pages or connection problems) without the program crashing.

Functional result: given a starting webpage and a depth limit, the program produces a list of unique webpages it found within the site, ignoring broken links or repeated pages.


Phase 2 — Domain-limited crawl & metadata

  • Restrict crawling to the same domain as the root URL.

  • For each page, collect/estimate metadata required for sitemap.xml:

    • loc (the page URL)
    • lastmod (from HTTP Last-Modified header, if available -- if not, use date of crawling)
    • changefreq (estimated using above Last-Modified header; daily if modified under 1 day before, weekly if under 1 week, etc.)
    • priority (calculated based on depth; the deeper the page, the lower the priority)
  • Handle errors like timeouts, redirects, and 404 pages.

Functional result: program outputs a structured list of URLs with all required metadata for a valid sitemap, ignoring external domains.


Phase 3 — Sitemap XML generation

  • Convert the crawled pages into a valid sitemap.xml format (as per the Sitemaps XML format), using the collected metadata.
  • Ensure the XML validates against the standard sitemap schema.
  • Write the XML to a file.

Functional result: running the program produces a valid sitemap.xml file with all crawled pages and their metadata.


Phase 4 — Robustness & logging

  • Handle network exceptions, invalid URLs, and write permission errors gracefully.
  • Ensure crawl queue does not duplicate URLs.
  • Log progress to the console or a log file: number of pages crawled, current URL, and any skipped or failed URLs.
  • Ensure sitemap generation completes reliably even if some pages fail.

Functional result: sitemap generation completes without crashing, all URLs are processed correctly up to the maximum depth, priority values correspond to depth, and progress is clearly logged during execution.