WorkerPool

This project consists of developing a distributed system composed of a master scheduler and multiple workers. The master enqueues the top sites per country (Semrush), and workers download pages in parallel using a shared message queue.

Phase 1: Queue Setup, CLI Parsing & System Layout

Define architecture and communication through Redis/RabbitMQ.

Validate queue configuration
Define JSON message format (link, disk path)
Prepare folder structure and logging

Functional Output: System correctly connects to the queue and accepts enqueue/dequeue operations.

Phase 2: Master Crawler for Semrush Country Pages

Implement the master that schedules downloads.

Fetch Semrush country listing pages (HTML parsing only)
Extract top 20 links per country
Generate storage paths

Functional Output: Master gathers URLs and prepares enqueue-ready tasks.

Phase 3: Task Enqueueing by Master

Push download jobs into the queue.

Serialize messages as JSON
Push jobs reliably with retry logic
Log enqueue successes/failures

Functional Output: Queue contains all download tasks with correct metadata.

Phase 4: Worker Implementation for Page Downloads

Implement workers that pull tasks and download content.

Consume tasks from queue
Download HTML content with error handling
Save page to specified disk location

Functional Output: Workers download and persist pages to the correct folders.

Phase 5: Parallelism, Scaling & Failure Recovery

Ensure distributed operation and robust behavior.

Support multiple worker instances
Retry failed downloads
Detect malformed messages or network issues

Functional Output: System handles parallel load, retries failures, and avoids message loss.

Phase 6: Logging, Monitoring & System Polishing

Refine logs, observability, and final output behavior.

Master and workers produce structured logs
Summaries of processed tasks
Optional monitoring dashboard or counters

Functional Output: Full logging available; system runs stably and produces all downloaded files.