The Challenge at G2

When I joined G2, our web crawler was processing 58,000 URLs per hour. We needed to scale to handle millions of domains efficiently.

Initial Architecture Problems

Monolithic Design: Single process handling everything
No Rate Limiting: Getting blocked by websites
Sequential Processing: Not utilizing available resources
Poor Error Handling: Failures cascaded

The Redesign

1. URL Frontier Service

We built a sophisticated queue management system:

class URLFrontier:
    def __init__(self):
        self.domain_queues = {}  # Separate queue per domain
        self.rate_limiter = SlidingWindowRateLimiter()
    
    def get_next_batch(self, batch_size=1000):
        urls = []
        for domain in self.domain_queues:
            if self.rate_limiter.can_crawl(domain):
                urls.extend(self.domain_queues[domain][:10])
        return urls[:batch_size]

2. AWS Batch for Parallel Processing

We leveraged AWS Batch to process URLs in parallel:

Job Queues: Separate queues for different priorities
Compute Environments: Auto-scaling based on queue depth
Docker Containers: Isolated, reproducible crawling environments

3. BFS Algorithm for Crawling

Implemented Breadth-First Search for efficient site traversal:

def crawl_site(start_url):
    visited = set()
    queue = deque([start_url])
    
    while queue:
        url = queue.popleft()
        if url in visited:
            continue
            
        visited.add(url)
        links = extract_links(url)
        queue.extend([l for l in links if l not in visited])

4. Sliding Window Rate Limiter

To avoid getting blocked:

class SlidingWindowRateLimiter:
    def __init__(self, max_requests=10, window_seconds=60):
        self.max_requests = max_requests
        self.window = window_seconds
        self.requests = defaultdict(deque)
    
    def can_crawl(self, domain):
        now = time.time()
        # Remove old requests outside window
        while (self.requests[domain] and 
               self.requests[domain][0] < now - self.window):
            self.requests[domain].popleft()
        
        return len(self.requests[domain]) < self.max_requests

Results

📈 3x Performance: 58K → 165K URLs/hour
🚀 16.5K Domains/hour: Domain ingestion rate
💰 Cost Efficient: Pay only for what we use
🛡️ Resilient: Failures don't cascade

Architecture Diagram

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│   Domain    │────▶│ URL Frontier │────▶│  AWS Batch  │
│  Ingestion  │     │   Service    │     │   Workers   │
└─────────────┘     └──────────────┘     └─────────────┘
                            │                     │
                            ▼                     ▼
                    ┌──────────────┐     ┌─────────────┐
                    │  PostgreSQL  │     │     S3      │
                    │   Metadata   │     │   Content   │
                    └──────────────┘     └─────────────┘

Tech Stack

Backend: Python, Ruby on Rails
Queue: AWS Kinesis, SQS
Processing: AWS Batch, Docker
Storage: PostgreSQL, S3
Crawling: Playwright, BeautifulSoup

Key Takeaways

Parallelize Everything: Use available resources

Rate Limiting is Critical: Respect website limits

Idempotency Matters: Handle retries gracefully

Monitor Queue Depth: Auto-scale based on backlog

Separate Concerns: Different services for different tasks

Future Improvements

Machine learning for intelligent crawl scheduling
Distributed crawling across multiple regions
Real-time priority adjustments
Content change detection

Building scalable systems is about making smart architectural choices and leveraging the right tools. AWS Batch gave us the elasticity we needed, while proper rate limiting ensured we were good citizens of the web.

Building Scalable Systems: Lessons from Processing 165K URLs per Hour

The Challenge at G2

Initial Architecture Problems

The Redesign

1. URL Frontier Service

2. AWS Batch for Parallel Processing

3. BFS Algorithm for Crawling

4. Sliding Window Rate Limiter

Results

Architecture Diagram

Tech Stack

Key Takeaways

Future Improvements

Tags:

Building Scalable Systems: Lessons from Processing 165K URLs per Hour

The Challenge at G2

Initial Architecture Problems

The Redesign

1. URL Frontier Service

2. AWS Batch for Parallel Processing

3. BFS Algorithm for Crawling

4. Sliding Window Rate Limiter

Results

Architecture Diagram

Tech Stack

Key Takeaways

Future Improvements

Tags:

Share this article: