Back to Blogs System Design

Building Scalable Systems: Lessons from Processing 165K URLs per Hour

Nitin Sharma November 20, 2025 10 min read 7 views

The Challenge at G2

When I joined G2, our web crawler was processing 58,000 URLs per hour. We needed to scale to handle millions of domains efficiently.

Initial Architecture Problems

  • Monolithic Design: Single process handling everything
  • No Rate Limiting: Getting blocked by websites
  • Sequential Processing: Not utilizing available resources
  • Poor Error Handling: Failures cascaded

The Redesign

1. URL Frontier Service

We built a sophisticated queue management system:

class URLFrontier:
    def __init__(self):
        self.domain_queues = {}  # Separate queue per domain
        self.rate_limiter = SlidingWindowRateLimiter()
    
    def get_next_batch(self, batch_size=1000):
        urls = []
        for domain in self.domain_queues:
            if self.rate_limiter.can_crawl(domain):
                urls.extend(self.domain_queues[domain][:10])
        return urls[:batch_size]

2. AWS Batch for Parallel Processing

We leveraged AWS Batch to process URLs in parallel:

  • Job Queues: Separate queues for different priorities
  • Compute Environments: Auto-scaling based on queue depth
  • Docker Containers: Isolated, reproducible crawling environments

3. BFS Algorithm for Crawling

Implemented Breadth-First Search for efficient site traversal:

def crawl_site(start_url):
    visited = set()
    queue = deque([start_url])
    
    while queue:
        url = queue.popleft()
        if url in visited:
            continue
            
        visited.add(url)
        links = extract_links(url)
        queue.extend([l for l in links if l not in visited])

4. Sliding Window Rate Limiter

To avoid getting blocked:

class SlidingWindowRateLimiter:
    def __init__(self, max_requests=10, window_seconds=60):
        self.max_requests = max_requests
        self.window = window_seconds
        self.requests = defaultdict(deque)
    
    def can_crawl(self, domain):
        now = time.time()
        # Remove old requests outside window
        while (self.requests[domain] and 
               self.requests[domain][0] < now - self.window):
            self.requests[domain].popleft()
        
        return len(self.requests[domain]) < self.max_requests

Results

  • πŸ“ˆ 3x Performance: 58K β†’ 165K URLs/hour
  • πŸš€ 16.5K Domains/hour: Domain ingestion rate
  • πŸ’° Cost Efficient: Pay only for what we use
  • πŸ›‘οΈ Resilient: Failures don't cascade

Architecture Diagram

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Domain    │────▢│ URL Frontier │────▢│  AWS Batch  β”‚
β”‚  Ingestion  β”‚     β”‚   Service    β”‚     β”‚   Workers   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚                     β”‚
                            β–Ό                     β–Ό
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  PostgreSQL  β”‚     β”‚     S3      β”‚
                    β”‚   Metadata   β”‚     β”‚   Content   β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Tech Stack

  • Backend: Python, Ruby on Rails
  • Queue: AWS Kinesis, SQS
  • Processing: AWS Batch, Docker
  • Storage: PostgreSQL, S3
  • Crawling: Playwright, BeautifulSoup

Key Takeaways

  • Parallelize Everything: Use available resources
  • Rate Limiting is Critical: Respect website limits
  • Idempotency Matters: Handle retries gracefully
  • Monitor Queue Depth: Auto-scale based on backlog
  • Separate Concerns: Different services for different tasks
  • Future Improvements

    • Machine learning for intelligent crawl scheduling
    • Distributed crawling across multiple regions
    • Real-time priority adjustments
    • Content change detection
    Building scalable systems is about making smart architectural choices and leveraging the right tools. AWS Batch gave us the elasticity we needed, while proper rate limiting ensured we were good citizens of the web.

    Tags:

    scalability aws python web scraping system design