WorkerPool
This project consists of developing a distributed system composed of a master scheduler and multiple workers. The master enqueues the top sites per country (Semrush), and workers download pages in parallel using a shared message queue.
Phase 1: Queue Setup, CLI Parsing & System Layout
Define architecture and communication through Redis/RabbitMQ.
- Validate queue configuration
- Define JSON message format (link, disk path)
- Prepare folder structure and logging
Functional Output: System correctly connects to the queue and accepts enqueue/dequeue operations.
Phase 2: Master Crawler for Semrush Country Pages
Implement the master that schedules downloads.
- Fetch Semrush country listing pages (HTML parsing only)
- Extract top 20 links per country
- Generate storage paths
Functional Output: Master gathers URLs and prepares enqueue-ready tasks.
Phase 3: Task Enqueueing by Master
Push download jobs into the queue.
- Serialize messages as JSON
- Push jobs reliably with retry logic
- Log enqueue successes/failures
Functional Output: Queue contains all download tasks with correct metadata.
Phase 4: Worker Implementation for Page Downloads
Implement workers that pull tasks and download content.
- Consume tasks from queue
- Download HTML content with error handling
- Save page to specified disk location
Functional Output: Workers download and persist pages to the correct folders.
Phase 5: Parallelism, Scaling & Failure Recovery
Ensure distributed operation and robust behavior.
- Support multiple worker instances
- Retry failed downloads
- Detect malformed messages or network issues
Functional Output: System handles parallel load, retries failures, and avoids message loss.
Phase 6: Logging, Monitoring & System Polishing
Refine logs, observability, and final output behavior.
- Master and workers produce structured logs
- Summaries of processed tasks
- Optional monitoring dashboard or counters
Functional Output: Full logging available; system runs stably and produces all downloaded files.