The Challenge at G2
When I joined G2, our web crawler was processing 58,000 URLs per hour. We needed to scale to handle millions of domains efficiently.
Initial Architecture Problems
- Monolithic Design: Single process handling everything
- No Rate Limiting: Getting blocked by websites
- Sequential Processing: Not utilizing available resources
- Poor Error Handling: Failures cascaded
The Redesign
1. URL Frontier Service
We built a sophisticated queue management system:
class URLFrontier:
def __init__(self):
self.domain_queues = {} # Separate queue per domain
self.rate_limiter = SlidingWindowRateLimiter()
def get_next_batch(self, batch_size=1000):
urls = []
for domain in self.domain_queues:
if self.rate_limiter.can_crawl(domain):
urls.extend(self.domain_queues[domain][:10])
return urls[:batch_size]
2. AWS Batch for Parallel Processing
We leveraged AWS Batch to process URLs in parallel:
- Job Queues: Separate queues for different priorities
- Compute Environments: Auto-scaling based on queue depth
- Docker Containers: Isolated, reproducible crawling environments
3. BFS Algorithm for Crawling
Implemented Breadth-First Search for efficient site traversal:
def crawl_site(start_url):
visited = set()
queue = deque([start_url])
while queue:
url = queue.popleft()
if url in visited:
continue
visited.add(url)
links = extract_links(url)
queue.extend([l for l in links if l not in visited])
4. Sliding Window Rate Limiter
To avoid getting blocked:
class SlidingWindowRateLimiter:
def __init__(self, max_requests=10, window_seconds=60):
self.max_requests = max_requests
self.window = window_seconds
self.requests = defaultdict(deque)
def can_crawl(self, domain):
now = time.time()
# Remove old requests outside window
while (self.requests[domain] and
self.requests[domain][0] < now - self.window):
self.requests[domain].popleft()
return len(self.requests[domain]) < self.max_requests
Results
- π 3x Performance: 58K β 165K URLs/hour
- π 16.5K Domains/hour: Domain ingestion rate
- π° Cost Efficient: Pay only for what we use
- π‘οΈ Resilient: Failures don't cascade
Architecture Diagram
βββββββββββββββ ββββββββββββββββ βββββββββββββββ
β Domain ββββββΆβ URL Frontier ββββββΆβ AWS Batch β
β Ingestion β β Service β β Workers β
βββββββββββββββ ββββββββββββββββ βββββββββββββββ
β β
βΌ βΌ
ββββββββββββββββ βββββββββββββββ
β PostgreSQL β β S3 β
β Metadata β β Content β
ββββββββββββββββ βββββββββββββββ
Tech Stack
- Backend: Python, Ruby on Rails
- Queue: AWS Kinesis, SQS
- Processing: AWS Batch, Docker
- Storage: PostgreSQL, S3
- Crawling: Playwright, BeautifulSoup
Key Takeaways
Future Improvements
- Machine learning for intelligent crawl scheduling
- Distributed crawling across multiple regions
- Real-time priority adjustments
- Content change detection