1. Hook #
Google processes 8.5 billion searches per day — roughly 99 000 queries per second at peak — and returns results in under 200 ms. Behind that sub-second response is a pipeline that never fully stops: a web crawler perpetually downloading ~20 billion pages, a MapReduce-scale indexing system converting raw HTML into a compressed inverted index, a multi-stage ranking pipeline that scores hundreds of signals in milliseconds, and a serving layer that shards the index across thousands of machines so no single query touches more than a fraction of the corpus. Building a search engine from scratch is perhaps the canonical “design a distributed system” problem because it combines almost every hard problem in the field: distributed crawling, large-scale data processing, near-real-time index updates, low-latency high-throughput query serving, and machine learning (ML)-based ranking. Even a simplified version at 1/1000th of Google’s scale teaches you more about distributed systems than almost any other exercise.
2. Problem Statement #
Functional Requirements #
- Users can submit a text query and receive a ranked list of relevant web pages within 200 ms.
- The crawler continuously discovers and fetches new and updated web pages.
- The index reflects new content within minutes for breaking news, within hours for general content.
- Results include a title, snippet (summary excerpt), and URL for each page.
- Basic query operators are supported: phrase search (
"exact phrase"), exclusion (-term), site filter (site:example.com).
Non-Functional Requirements #
| Attribute | Target |
|---|---|
| Query latency (p99) | < 200 ms end-to-end |
| Index freshness (news tier) | < 15 min from publish to indexed |
| Index freshness (general web) | < 24 hours |
| Crawler politeness | Respect robots.txt; max 1 req/s per domain by default |
| Corpus size | 200 B web pages indexed |
| Query throughput | 100 000 QPS (Queries Per Second) peak |
| Availability | 99.99% |
Out of Scope #
- Image, video, or news search (different index schemas and crawlers)
- Personalised search (user history, account signals)
- Ads auction and placement
- Knowledge Graph / entity extraction
- Voice search and NLU (Natural Language Understanding)
3. Scale Estimation #
Assumptions:
- 200 B indexed pages; average page size 100 KB raw HTML → 20 PB raw HTML.
- Compressed inverted index: ~10% of raw HTML → ~2 PB index.
- Crawl rate needed to refresh 200 B pages every 30 days: 200 B / (30 × 86 400 s) ≈ 77 000 pages/s.
- Average query: 4 terms; each term hits ~500 M index postings before ranking cuts it to top 10.
- Index served from RAM: each index shard holds ~500 M postings × 8 bytes = ~4 GB/shard; 500 shards cover the full index.
- Result snippets: pre-computed and stored separately (~200 bytes/page → 40 TB).
| Metric | Calculation | Result |
|---|---|---|
| Crawl write QPS | 77 000 pages/s × 100 KB | ~7.7 GB/s raw ingest |
| Index storage | 2 PB compressed inverted index | 500 shards × 4 GB each |
| Query QPS | 8.5 B/day / 86 400 s | ~98 000 QPS |
| Index reads per query | 4 terms × 500 shards (fan-out) | 2 000 parallel shard reads |
| Snippet store | 200 B pages × 200 bytes | ~40 TB |
| PageRank (PR) recomputation | Full graph: 200 B nodes, ~1 T edges | Hours (offline MapReduce) |
| Cache size (hot queries, 5% of QPS) | 5 000 QPS × 500 bytes/result | ~2.5 MB/s → 100 GB LRU |
The key insight: the query fan-out (2 000 parallel shard reads per query) means latency is dominated by the slowest shard, not average shard latency. This drives the hedge-and-cancel (backup request) pattern.
4. High-Level Design #
A search engine has four distinct pipelines: crawl, process/index, rank, and serve. Crawl and index are offline/near-real-time bulk pipelines; serve is a latency-critical online system.
flowchart TD
subgraph Crawl["Crawl Pipeline"]
URL[URL Frontier\npriority queue]
FE[Fetcher Fleet\ndistributed HTTP crawlers]
DS[Duplicate Store\nSimhash dedup]
RS[Raw Store\nGCS / HDFS]
end
subgraph Index["Index Pipeline (batch + streaming)"]
PA[HTML Parser\nlink extractor]
TK[Tokeniser &\nText Analyser]
II[Inverted Index Builder\nMapReduce / Dataflow]
IS[(Index Shards\nBigtable / custom)]
SN[(Snippet Store\npre-computed summaries)]
end
subgraph Rank["Ranking Pipeline (offline)"]
LG[Link Graph Builder]
PR[PageRank / TrustRank\noffline MapReduce]
RS2[(Rank Store\ndoc → score)]
end
subgraph Serve["Query Serving"]
QP[Query Parser\ntokenise, expand, operators]
FN[Fan-out Layer\nscatter to index shards]
IS
MG[Merge & Score\nBM25 + PR + ML ranker]
SN
QC[Query Cache\nRedis LRU]
API[Search API\nJSON response]
end
URL --> FE
FE -->|raw HTML| DS
DS -->|unique| RS
RS --> PA
PA -->|tokens| TK
TK --> II
II --> IS
PA -->|links| LG
LG --> PR
PR --> RS2
II --> SN
API --> QC
QC -->|miss| QP
QP --> FN
FN -->|parallel reads| IS
IS -->|top-K posting lists| MG
MG -->|doc IDs| SN
RS2 -->|PR scores| MG
MG --> API
Component Roles #
| Component | Responsibility | Key Choice |
|---|---|---|
| URL Frontier | Priority queue of URLs to crawl; enforces per-domain politeness | Bloom filter for visited URLs; priority by PageRank × freshness score |
| Fetcher Fleet | Download pages; respect robots.txt and crawl-delay | Async I/O (Netty / async-http-client); distributed across regions |
| Duplicate Store | Near-duplicate detection before storing raw HTML | Simhash fingerprint; 64-bit hash with Hamming distance ≤ 3 = duplicate |
| Inverted Index | Maps every token → sorted posting list of (doc_id, tf, positions) | Sharded by term hash; built via MapReduce; served from RAM |
| PageRank Store | Pre-computed authority score per doc; updated offline daily | Iterative graph algorithm (Pregel / Spark GraphX); ~50 iterations to converge |
| Merge & Score | Intersect/union posting lists; apply BM25 + PR + ML signals; top-K selection | Two-phase: coarse BM25 (top 1 000) → ML re-ranker (top 10) |
5. Deep Dive — Critical Components #
5a. Web Crawler #
The crawler is a distributed system in its own right. The URL Frontier is a priority queue with two orthogonal constraints:
- Politeness: Never send more than 1 request per second to any single domain. Implemented as per-domain back-off queues: a URL for
example.comis not dequeued until 1 s has elapsed since the lastexample.comfetch. - Priority: Higher-PageRank domains get crawled more frequently. A blog updated daily needs a fresh crawl daily; a static Wikipedia article needs one only weekly.
Near-duplicate detection uses Simhash: the page’s text is split into shingles (overlapping N-grams), each hashed, and the hashes combined into a single 64-bit fingerprint. Two pages with Hamming distance ≤ 3 are considered near-duplicates and only the canonical version is indexed. This eliminates ~30% of the web corpus.
record CrawlTask(String url, int priority, Instant notBefore) {}
// Simplified frontier: one back-off queue per domain
class URLFrontier {
private final Map<String, PriorityQueue<CrawlTask>> domainQueues = new ConcurrentHashMap<>();
private final Map<String, Instant> lastFetchTime = new ConcurrentHashMap<>();
public Optional<CrawlTask> next() {
Instant now = Instant.now();
for (var entry : domainQueues.entrySet()) {
String domain = entry.getKey();
Instant ready = lastFetchTime.getOrDefault(domain, Instant.EPOCH).plusSeconds(1);
if (now.isAfter(ready)) {
var q = entry.getValue();
if (!q.isEmpty()) {
lastFetchTime.put(domain, now);
return Optional.of(q.poll());
}
}
}
return Optional.empty();
}
}5b. Inverted Index #
The inverted index maps each token to a posting list: a sorted array of (doc_id, term_frequency, [positions]) entries. For the query “distributed systems”, the engine fetches the posting lists for both terms and intersects them (AND query) or unions them (OR query).
Index construction uses a MapReduce pipeline:
- Map phase: For each (doc_id, raw_text) pair, emit
(token, (doc_id, tf, positions)). - Reduce phase: For each token, collect all (doc_id, tf, positions) entries, sort by doc_id, compress with delta encoding (store differences between consecutive doc_ids rather than absolute IDs — the deltas are small integers, compressible with variable-length encoding like VByte).
A 200 B page corpus generates roughly 2 PB of compressed index data, sharded across 500 machines. Each shard holds the posting lists for a disjoint subset of tokens (sharded by hash(token) % 500).
Index updates: Batch MapReduce runs nightly for the general web tier. For news/freshness tier, a streaming pipeline (Kafka + Dataflow) produces incremental index segments merged into the live index every 15 minutes using a log-structured merge (LSM) approach.
5c. Query Processing and Fan-out #
When a query arrives:
- Parse: Tokenise, apply stemming/lemmatisation, expand synonyms, detect operators (
site:,-,""). - Fan-out (scatter): Send each query term to its responsible index shard in parallel. With 4-term queries across 500 shards, each query generates ~2 000 parallel RPCs (Remote Procedure Calls). The fan-out layer uses the hedge request pattern: if a shard’s response is not received within 50 ms (p95 latency), issue a second request to a replica. Cancel the slower one when either responds. This bounds tail latency at the cost of ~5% extra load.
- Merge: Each shard returns its top-K (top 1 000) doc_ids with BM25 (Best Match 25) scores. The merge layer intersects/unions these lists and applies the two-phase ranker.
- Rank (Phase 1 — BM25): A statistical relevance score using term frequency (TF) and inverse document frequency (IDF). Eliminates most docs; retains top 1 000.
- Rank (Phase 2 — ML re-ranker): A learned model (historically LambdaMART, now a Transformer-based model) applies hundreds of signals: PageRank, anchor text quality, freshness, user engagement signals, page speed. Produces final top 10.
5d. PageRank at Scale #
PageRank models the web as a directed graph where each page’s score is the weighted sum of the scores of pages linking to it. The iterative formula converges in ~50 iterations. At 200 B nodes and ~1 T edges, this requires a distributed graph-processing framework.
Google’s original implementation was a MapReduce job that ran for hours. Modern implementations use Pregel (Google’s vertex-centric graph processing model, now open-sourced as Apache Giraph / Spark GraphX): each vertex holds its current rank and sends messages to its neighbours. Convergence is detected when the change in all vertex scores drops below a threshold.
PageRank is recomputed daily. Between recomputations, new pages receive a provisional score based on the average score of pages linking to them (a heuristic, not exact PR).
6. Data Model #
documents (Raw Store — GCS / HDFS)
#
Object key: raw/{crawl_date}/{url_sha256}.gz
| Field | Type | Notes |
|---|---|---|
url |
STRING | Canonical URL after redirect chain |
fetched_at |
TIMESTAMP | |
http_status |
INT | 200 / 301 / 404 etc. |
raw_html |
BYTES (compressed) | Full page HTML |
simhash |
INT64 | 64-bit near-dup fingerprint |
outlinks |
STRING[] | Extracted outbound URLs |
index_postings (per shard — in-memory / Bigtable)
#
Row key: {token} (within a shard that owns this token’s hash range)
| Column | Type | Notes |
|---|---|---|
posting_list |
BYTES | Delta-encoded, VByte-compressed array of (doc_id, tf, positions[]) |
doc_freq |
INT64 | Number of documents containing this term (for IDF calculation) |
last_updated |
TIMESTAMP | Time of last incremental merge |
doc_metadata (Bigtable — keyed by doc_id)
#
| Column | Type | Notes |
|---|---|---|
url |
STRING | Canonical URL |
title |
STRING | Extracted <title> |
snippet |
STRING | Pre-computed 160-char summary |
pagerank |
FLOAT | Latest PR score |
language |
STRING | ISO 639-1 language code |
indexed_at |
TIMESTAMP | Last index update |
content_hash |
STRING | SHA-256 of page body; change detection |
link_graph (Bigtable — for PageRank)
#
Row key: {source_doc_id}
| Column | Type | Notes |
|---|---|---|
out_edges |
INT64[] | Target doc_ids (compressed) |
in_degree |
INT64 | Number of inbound links |
7. Trade-offs #
Index Sharding: by Term vs by Document #
| Option | Pros | Cons | When to choose |
|---|---|---|---|
| Term-partitioned (shard by token hash) | All postings for a term on one machine; no cross-shard intersection | Hot terms (common words) create hot shards; fan-out to many shards per query | Google’s original model; good when corpus fits in RAM per shard |
| Document-partitioned (shard by doc_id) | Even load distribution; no hot shards | Must intersect posting lists across all shards for every query term | Large corpora where term-partitioned shards become too large |
Conclusion: Google uses document-partitioned index at scale. Each shard holds all terms for a subset of documents; a query fans out to all shards, and each shard returns its local top-K. This balances load better at 200 B page scale.
Query Cache: Full Result Cache vs Posting List Cache #
| Option | Pros | Cons | When to choose |
|---|---|---|---|
| Full result cache (Redis, key = normalised query) | Zero fan-out on cache hit | Low hit rate for long-tail queries; stale results | Head queries (top 5% account for ~80% of traffic) |
| Posting list cache (cache hot posting lists in RAM on each shard) | Helps all queries containing hot terms | Still requires merge and rank on hit | Universal — always worth doing |
Conclusion: Both layers. Cache full results for top-1000 queries (very high hit rate). Cache hot posting lists in RAM on each index shard to speed up the fan-out step for the long tail.
Freshness vs Throughput: Streaming vs Batch Index Updates #
| Option | Pros | Cons | When to choose |
|---|---|---|---|
| Batch MapReduce (nightly) | High throughput; simple operational model | 24-hour freshness lag | General web tier |
| Streaming (Kafka + Dataflow, 15 min segments) | Near-real-time freshness | Operational complexity; segment merging overhead | News / freshness tier |
| Real-time (per-page update on crawl) | Immediate index update | Very high write amplification; hard to maintain index quality | Breaking news only — maintained as a separate “freshness index” |
Conclusion: Tiered freshness. A small “freshness index” (top ~1 B pages, updated continuously) is merged with the main index at query time. The main index is rebuilt nightly via batch pipeline.
8. Failure Modes #
| Component | Failure | Impact | Mitigation |
|---|---|---|---|
| Index shard crash | Queries missing postings for that shard's token range | Degraded result quality; missing documents | Each shard has ≥ 2 replicas; fan-out load-balances across replicas; hedge requests detect slow shards |
| Hot term (e.g. "breaking news keyword") | Single shard responsible for that term overwhelmed | Latency spike for all queries containing that term | Replicate hot-term postings to additional shard replicas; route queries round-robin across replicas; cache full posting list in RAM |
| Crawler overloads a domain | Domain rate-limits or blocks crawler IP range | Stale index for that domain | Strict per-domain politeness (1 req/s); honour robots.txt Crawl-delay; exponential back-off on 429/503 responses |
| Index pipeline failure (batch job) | Nightly index rebuild fails partway through | Stale index served; no freshness degradation if old index kept live | Blue-green index deployment: new index built offline, atomically swapped into serving when build completes and passes quality checks |
| Query fan-out slow shard (tail latency) | One of 500 fan-out RPCs takes 500 ms | Entire query latency blown | Hedge request after 50 ms; cancel slower twin; serves result from whichever replica responds first |
| Spam / link farm injection | Low-quality pages rank highly via artificial link schemes | Degraded result quality | TrustRank (seed from known-good domains); spam classifiers on crawled content; manual quality rater feedback loop |
9. Security & Compliance #
Bot Detection & Crawler Identity:
- The crawler identifies itself via a known User-Agent string (
Googlebot) and a verifiable IP range published in DNS (Domain Name System). Websites can verify crawler legitimacy via reverse-DNS lookup. - The crawler strictly honours
robots.txt— fetching it first on every domain before any other page. Violations are logged and treated as bugs.
Query AuthN/AuthZ (Authentication / Authorization):
- Anonymous queries are allowed for the public web search product. Rate limiting by IP (token bucket: 100 queries/min for anonymous; higher for authenticated API users).
- Search API (Programmable Search Engine) requires OAuth 2.0 API keys with per-key QPS quotas.
Data Privacy:
- Query logs are retained for 18 months (reduced from permanent after EU regulatory pressure); anonymised after 9 months (cookie/IP association removed).
- Right to be Forgotten (EU GDPR Article 17): search results linking to specific pages can be de-indexed via a verified removal request. Removal is applied to the serving layer as a blocklist — the raw index is not rebuilt.
Content Safety:
- SafeSearch filters are applied at the ranking layer: a classifier scores pages for adult/violent content; filtered pages are suppressed for SafeSearch-on queries.
- CSAM (Child Sexual Abuse Material) detection: PhotoDNA-style perceptual hashing on crawled images; automatic de-indexing and reporting to NCMEC (National Center for Missing and Exploited Children).
Encryption:
- All internal RPC (Remote Procedure Call) traffic between crawler, indexer, and serving is over mTLS (mutual TLS). Query results served over TLS 1.3.
- Raw HTML store (GCS/HDFS) encrypted at rest with AES-256.
10. Observability #
RED Metrics #
| Signal | Metric | Alert Threshold |
|---|---|---|
| Query rate | search_queries_total |
< 70% of 7-day baseline (index outage signal) |
| Query error rate | search_errors_total / total |
> 0.01% |
| Query latency (p99) | query_duration_seconds |
> 200 ms |
| Fan-out latency (p99 per shard) | shard_rtt_seconds |
> 100 ms |
| Crawl rate | pages_fetched_per_second |
< 50% of target (crawler issue) |
| Index freshness lag | index_lag_seconds (time since page published) |
> 30 min for news tier |
Saturation Metrics #
| Resource | Metric | Alert Threshold |
|---|---|---|
| Index shard RAM | shard_memory_utilisation |
> 85% (eviction risk) |
| Crawl queue depth | url_frontier_size |
> 7-day rolling avg × 2 (backlog growing) |
| Merge layer CPU | ranker_cpu_utilisation |
> 75% |
Business Metrics #
- Click-through rate (CTR) on top-3 results: proxy for ranking quality; significant drop signals ranking regression.
- Zero-result rate: fraction of queries returning no results; should be < 0.1%.
- Index coverage: fraction of known live URLs that are indexed; target > 99%.
Tracing #
Each query carries a search_trace_id propagated across fan-out RPCs. Slow-query traces (p99+ latency) are automatically sampled to Cloud Trace / Jaeger. Correlate with shard-level metrics to identify which shard caused tail latency.
11. Scaling Path #
Phase 1 — MVP (< 1M indexed pages) #
Single machine. SQLite full-text search (FTS5) or Elasticsearch single-node. Crawl with a simple Python script. No deduplication. Ranking: TF-IDF only. Serve queries directly from the index on the same machine.
What breaks first: Elasticsearch single-node tops out around 50M documents before search latency exceeds 200 ms. RAM becomes the bottleneck for posting list hot data.
Phase 2 — Growth (1M → 100M pages) #
Elasticsearch cluster (5–10 nodes, document-partitioned). Dedicated crawler fleet (10–50 machines). Crawl queue in Redis. Simhash dedup. Nightly index rebuild. No ML ranker — pure BM25 + PageRank. Add Redis result cache.
What breaks first: PageRank computation. At 100M pages, a single-machine PageRank job takes hours. Migrate to Spark GraphX / Pregel.
Phase 3 — Scale (100M → 10B pages) #
Custom sharded inverted index serving from RAM. Streaming index updates (Kafka + Flink) for freshness tier. ML re-ranker (LambdaMART) added as Phase 2 ranker. Fan-out gateway with hedge requests. Multi-region crawler (crawl from region nearest to the target domain). Blue-green index deployment.
What breaks first: Snippet generation at scale. Pre-compute and store snippets per (doc_id, query_cluster) rather than per raw query — cluster queries by topic and generate representative snippets offline.
Phase 4 — Hyperscale (10B → 200B+ pages) #
Tiered index: freshness shard (top 1 B frequently-updated pages) + main shard (remaining 200 B). Tiered storage: hot posting lists in DRAM, warm in NVMe SSD, cold in HDD. Neural ranking model (BERT-based) for top-10 re-rank. Knowledge Graph overlay for entity queries. Distributed link graph with Pregel at 1 T edge scale. Per-query ML feature computation (real-time signals: query freshness intent, user geography).
12. Enterprise Considerations #
Brownfield Integration:
- Enterprise search products (Google Workspace Search, Elastic Enterprise Search, Microsoft SharePoint Search) crawl internal document stores (Confluence, SharePoint, Jira, GDrive) rather than the public web. The same inverted index and BM25 ranking apply; the crawl component is replaced with API-based connectors per data source.
Build vs Buy:
- For a startup / internal search engine: Elasticsearch or OpenSearch — mature, operationally well-understood, pluggable ranking.
- For 1 B+ docs: consider Apache Solr with SolrCloud for sharding, or a custom serving layer built on Apache Lucene (the index library that underpins both Elasticsearch and Solr).
- Web crawler: Apache Nutch (open source, production-grade) or Scrapy (Python, simpler). Never build a production crawler from scratch — robots.txt compliance, dedup, and politeness are deceptively hard.
- ML ranker: LightGBM / XGBoost for LTR (Learning to Rank) on small corpora; Transformer-based models (BERT, T5) for re-ranking at scale.
- PageRank / link analysis: Apache Spark GraphX or Google’s Pregel (via Dataproc).
Multi-Tenancy:
- SaaS enterprise search: each tenant gets a logically isolated index namespace. Shared serving infrastructure but strict per-tenant data isolation (no cross-tenant posting list access).
- Noisy neighbour: a single tenant running expensive analytical queries (site: operator across millions of docs) can starve other tenants. Enforce per-tenant QPS and query complexity limits.
TCO Ballpark (at 10 B page scale):
- Index storage (10 TB compressed): ~$200/month on NVMe-backed Bigtable.
- Serving fleet (500 shards × 2 replicas × 8-core machines): ~$150K/month.
- Crawl fleet (1 000 fetcher instances): ~$30K/month.
- PageRank MapReduce (daily, 100-node Dataproc cluster, 4 hours): ~$2K/month.
- Elasticsearch alternative at 10 B docs: similar cost but higher operational complexity.
Conway’s Law Implication: Search engines almost always split into separate teams along the crawl/index/rank/serve boundary. The index format is the contract between the index team and the serving team — treat it as a versioned API and never break backward compatibility mid-deployment.
13. Interview Tips #
- Sketch the four-stage pipeline first. Crawl → index → rank → serve. Interviewers want to see that you understand the system has both an offline pipeline (building the index) and an online system (serving queries). Conflating them is the most common mistake.
- Explain the inverted index before any other data structure. Every search system worth designing is built on an inverted index. Know what a posting list is, how it is compressed (delta encoding + VByte), and why it is sorted by doc_id (enables merge intersection in linear time).
- Don’t forget politeness. Candidates often design a crawler that would DDoS every website it visits. Know about robots.txt, Crawl-delay, per-domain rate limiting, and exponential back-off. Interviewers at companies with real crawlers will probe this.
- Nail the fan-out latency problem. “How do you keep query latency under 200 ms when you fan out to 500 shards?” The answer is hedge requests (backup requests to slow shards) + serving the top-K from each shard rather than full posting lists. This demonstrates operational sophistication.
- Vocabulary that signals fluency: inverted index, posting list, TF-IDF (Term Frequency–Inverse Document Frequency), BM25, PageRank, Simhash, Bloom filter, delta encoding, VByte compression, URL Frontier, politeness budget, hedge request, blue-green index swap, LTR (Learning to Rank), freshness tier, document-partitioned vs term-partitioned index.
14. Further Reading #
- “The Anatomy of a Large-Scale Hypertextual Web Search Engine” (Brin & Page, 1998): The original Google paper. Explains PageRank, the inverted index design, and the two-server architecture. Still required reading — most of the ideas hold up 25 years later.
- “MapReduce: Simplified Data Processing on Large Clusters” (Dean & Ghemawat, OSDI 2004): The batch processing primitive that powers the index build pipeline. The paper is short and concrete.
- “Pregel: A System for Large-Scale Graph Processing” (Malewicz et al., SIGMOD 2010): Google’s distributed PageRank computation framework. Covers the vertex-centric model and barrier synchronisation.
- “Challenges in Building Large-Scale Information Retrieval Systems” (Dean, WSDM 2009): Google’s own retrospective on scaling from the original paper to 2009. Covers the transition from term-partitioned to document-partitioned index.
- Elasticsearch documentation — “Inside a Shard”: A well-written walkthrough of how Lucene segments work, how merges happen, and how Elasticsearch distributes them. Useful as a concrete reference when discussing index architecture.