WorkerPool

This project consists of developing a distributed system composed of a master scheduler and multiple workers. The master enqueues the top sites per country (Semrush), and workers download pages in parallel using a shared message queue.

Phase 1: Queue Setup, CLI Parsing & System Layout

Define architecture and communication through Redis/RabbitMQ.

  • Validate queue configuration
  • Define JSON message format (link, disk path)
  • Prepare folder structure and logging

Functional Output: System correctly connects to the queue and accepts enqueue/dequeue operations.


Phase 2: Master Crawler for Semrush Country Pages

Implement the master that schedules downloads.

  • Fetch Semrush country listing pages (HTML parsing only)
  • Extract top 20 links per country
  • Generate storage paths

Functional Output: Master gathers URLs and prepares enqueue-ready tasks.


Phase 3: Task Enqueueing by Master

Push download jobs into the queue.

  • Serialize messages as JSON
  • Push jobs reliably with retry logic
  • Log enqueue successes/failures

Functional Output: Queue contains all download tasks with correct metadata.


Phase 4: Worker Implementation for Page Downloads

Implement workers that pull tasks and download content.

  • Consume tasks from queue
  • Download HTML content with error handling
  • Save page to specified disk location

Functional Output: Workers download and persist pages to the correct folders.


Phase 5: Parallelism, Scaling & Failure Recovery

Ensure distributed operation and robust behavior.

  • Support multiple worker instances
  • Retry failed downloads
  • Detect malformed messages or network issues

Functional Output: System handles parallel load, retries failures, and avoids message loss.


Phase 6: Logging, Monitoring & System Polishing

Refine logs, observability, and final output behavior.

  • Master and workers produce structured logs
  • Summaries of processed tasks
  • Optional monitoring dashboard or counters

Functional Output: Full logging available; system runs stably and produces all downloaded files.