sitemap.xml Generator
Phase 1 — Core crawling logic
- Implement a function that starts from a given webpage (root URL) and explores other pages linked from it, up to a certain depth (how many links away from the start page you go).
- Keep track of pages already visited so the program doesn’t check the same page twice.
- Check that each URL is valid and handle errors (like missing pages or connection problems) without the program crashing.
Functional result: given a starting webpage and a depth limit, the program produces a list of unique webpages it found within the site, ignoring broken links or repeated pages.
Phase 2 — Domain-limited crawl & metadata
-
Restrict crawling to the same domain as the root URL.
-
For each page, collect/estimate metadata required for
sitemap.xml:loc(the page URL)lastmod(from HTTPLast-Modifiedheader, if available -- if not, use date of crawling)changefreq(estimated using aboveLast-Modifiedheader;dailyif modified under 1 day before,weeklyif under 1 week, etc.)priority(calculated based on depth; the deeper the page, the lower the priority)
-
Handle errors like timeouts, redirects, and 404 pages.
Functional result: program outputs a structured list of URLs with all required metadata for a valid sitemap, ignoring external domains.
Phase 3 — Sitemap XML generation
- Convert the crawled pages into a valid
sitemap.xmlformat (as per the Sitemaps XML format), using the collected metadata. - Ensure the XML validates against the standard sitemap schema.
- Write the XML to a file.
Functional result: running the program produces a valid sitemap.xml file with all crawled pages and their metadata.
Phase 4 — Robustness & logging
- Handle network exceptions, invalid URLs, and write permission errors gracefully.
- Ensure crawl queue does not duplicate URLs.
- Log progress to the console or a log file: number of pages crawled, current URL, and any skipped or failed URLs.
- Ensure sitemap generation completes reliably even if some pages fail.
Functional result: sitemap generation completes without crashing, all URLs are processed correctly up to the maximum depth, priority values correspond to depth, and progress is clearly logged during execution.