Skip to main content
  1. System designs - 100+/
  2. Classic/

Search Engine — Google-Scale Crawl, Index, Rank, and Serve

Lakshay Jawa
Author
Lakshay Jawa
Sharing knowledge on system design, Java, Spring, and software engineering best practices.
Table of Contents

1. Hook
#

Google processes 8.5 billion searches per day — roughly 99 000 queries per second at peak — and returns results in under 200 ms. Behind that sub-second response is a pipeline that never fully stops: a web crawler perpetually downloading ~20 billion pages, a MapReduce-scale indexing system converting raw HTML into a compressed inverted index, a multi-stage ranking pipeline that scores hundreds of signals in milliseconds, and a serving layer that shards the index across thousands of machines so no single query touches more than a fraction of the corpus. Building a search engine from scratch is perhaps the canonical “design a distributed system” problem because it combines almost every hard problem in the field: distributed crawling, large-scale data processing, near-real-time index updates, low-latency high-throughput query serving, and machine learning (ML)-based ranking. Even a simplified version at 1/1000th of Google’s scale teaches you more about distributed systems than almost any other exercise.


2. Problem Statement
#

Functional Requirements
#

  1. Users can submit a text query and receive a ranked list of relevant web pages within 200 ms.
  2. The crawler continuously discovers and fetches new and updated web pages.
  3. The index reflects new content within minutes for breaking news, within hours for general content.
  4. Results include a title, snippet (summary excerpt), and URL for each page.
  5. Basic query operators are supported: phrase search ("exact phrase"), exclusion (-term), site filter (site:example.com).

Non-Functional Requirements
#

Attribute Target
Query latency (p99) < 200 ms end-to-end
Index freshness (news tier) < 15 min from publish to indexed
Index freshness (general web) < 24 hours
Crawler politeness Respect robots.txt; max 1 req/s per domain by default
Corpus size 200 B web pages indexed
Query throughput 100 000 QPS (Queries Per Second) peak
Availability 99.99%

Out of Scope
#

  • Image, video, or news search (different index schemas and crawlers)
  • Personalised search (user history, account signals)
  • Ads auction and placement
  • Knowledge Graph / entity extraction
  • Voice search and NLU (Natural Language Understanding)

3. Scale Estimation
#

Assumptions:

  • 200 B indexed pages; average page size 100 KB raw HTML → 20 PB raw HTML.
  • Compressed inverted index: ~10% of raw HTML → ~2 PB index.
  • Crawl rate needed to refresh 200 B pages every 30 days: 200 B / (30 × 86 400 s) ≈ 77 000 pages/s.
  • Average query: 4 terms; each term hits ~500 M index postings before ranking cuts it to top 10.
  • Index served from RAM: each index shard holds ~500 M postings × 8 bytes = ~4 GB/shard; 500 shards cover the full index.
  • Result snippets: pre-computed and stored separately (~200 bytes/page → 40 TB).
Metric Calculation Result
Crawl write QPS 77 000 pages/s × 100 KB ~7.7 GB/s raw ingest
Index storage 2 PB compressed inverted index 500 shards × 4 GB each
Query QPS 8.5 B/day / 86 400 s ~98 000 QPS
Index reads per query 4 terms × 500 shards (fan-out) 2 000 parallel shard reads
Snippet store 200 B pages × 200 bytes ~40 TB
PageRank (PR) recomputation Full graph: 200 B nodes, ~1 T edges Hours (offline MapReduce)
Cache size (hot queries, 5% of QPS) 5 000 QPS × 500 bytes/result ~2.5 MB/s → 100 GB LRU

The key insight: the query fan-out (2 000 parallel shard reads per query) means latency is dominated by the slowest shard, not average shard latency. This drives the hedge-and-cancel (backup request) pattern.


4. High-Level Design
#

A search engine has four distinct pipelines: crawl, process/index, rank, and serve. Crawl and index are offline/near-real-time bulk pipelines; serve is a latency-critical online system.

flowchart TD
    subgraph Crawl["Crawl Pipeline"]
        URL[URL Frontier\npriority queue]
        FE[Fetcher Fleet\ndistributed HTTP crawlers]
        DS[Duplicate Store\nSimhash dedup]
        RS[Raw Store\nGCS / HDFS]
    end

    subgraph Index["Index Pipeline (batch + streaming)"]
        PA[HTML Parser\nlink extractor]
        TK[Tokeniser &\nText Analyser]
        II[Inverted Index Builder\nMapReduce / Dataflow]
        IS[(Index Shards\nBigtable / custom)]
        SN[(Snippet Store\npre-computed summaries)]
    end

    subgraph Rank["Ranking Pipeline (offline)"]
        LG[Link Graph Builder]
        PR[PageRank / TrustRank\noffline MapReduce]
        RS2[(Rank Store\ndoc → score)]
    end

    subgraph Serve["Query Serving"]
        QP[Query Parser\ntokenise, expand, operators]
        FN[Fan-out Layer\nscatter to index shards]
        IS
        MG[Merge & Score\nBM25 + PR + ML ranker]
        SN
        QC[Query Cache\nRedis LRU]
        API[Search API\nJSON response]
    end

    URL --> FE
    FE -->|raw HTML| DS
    DS -->|unique| RS
    RS --> PA
    PA -->|tokens| TK
    TK --> II
    II --> IS
    PA -->|links| LG
    LG --> PR
    PR --> RS2
    II --> SN
    API --> QC
    QC -->|miss| QP
    QP --> FN
    FN -->|parallel reads| IS
    IS -->|top-K posting lists| MG
    MG -->|doc IDs| SN
    RS2 -->|PR scores| MG
    MG --> API

Component Roles
#

Component Responsibility Key Choice
URL Frontier Priority queue of URLs to crawl; enforces per-domain politeness Bloom filter for visited URLs; priority by PageRank × freshness score
Fetcher Fleet Download pages; respect robots.txt and crawl-delay Async I/O (Netty / async-http-client); distributed across regions
Duplicate Store Near-duplicate detection before storing raw HTML Simhash fingerprint; 64-bit hash with Hamming distance ≤ 3 = duplicate
Inverted Index Maps every token → sorted posting list of (doc_id, tf, positions) Sharded by term hash; built via MapReduce; served from RAM
PageRank Store Pre-computed authority score per doc; updated offline daily Iterative graph algorithm (Pregel / Spark GraphX); ~50 iterations to converge
Merge & Score Intersect/union posting lists; apply BM25 + PR + ML signals; top-K selection Two-phase: coarse BM25 (top 1 000) → ML re-ranker (top 10)

5. Deep Dive — Critical Components
#

5a. Web Crawler
#

The crawler is a distributed system in its own right. The URL Frontier is a priority queue with two orthogonal constraints:

  1. Politeness: Never send more than 1 request per second to any single domain. Implemented as per-domain back-off queues: a URL for example.com is not dequeued until 1 s has elapsed since the last example.com fetch.
  2. Priority: Higher-PageRank domains get crawled more frequently. A blog updated daily needs a fresh crawl daily; a static Wikipedia article needs one only weekly.

Near-duplicate detection uses Simhash: the page’s text is split into shingles (overlapping N-grams), each hashed, and the hashes combined into a single 64-bit fingerprint. Two pages with Hamming distance ≤ 3 are considered near-duplicates and only the canonical version is indexed. This eliminates ~30% of the web corpus.

record CrawlTask(String url, int priority, Instant notBefore) {}

// Simplified frontier: one back-off queue per domain
class URLFrontier {
    private final Map<String, PriorityQueue<CrawlTask>> domainQueues = new ConcurrentHashMap<>();
    private final Map<String, Instant> lastFetchTime = new ConcurrentHashMap<>();

    public Optional<CrawlTask> next() {
        Instant now = Instant.now();
        for (var entry : domainQueues.entrySet()) {
            String domain = entry.getKey();
            Instant ready = lastFetchTime.getOrDefault(domain, Instant.EPOCH).plusSeconds(1);
            if (now.isAfter(ready)) {
                var q = entry.getValue();
                if (!q.isEmpty()) {
                    lastFetchTime.put(domain, now);
                    return Optional.of(q.poll());
                }
            }
        }
        return Optional.empty();
    }
}

5b. Inverted Index
#

The inverted index maps each token to a posting list: a sorted array of (doc_id, term_frequency, [positions]) entries. For the query “distributed systems”, the engine fetches the posting lists for both terms and intersects them (AND query) or unions them (OR query).

Index construction uses a MapReduce pipeline:

  • Map phase: For each (doc_id, raw_text) pair, emit (token, (doc_id, tf, positions)).
  • Reduce phase: For each token, collect all (doc_id, tf, positions) entries, sort by doc_id, compress with delta encoding (store differences between consecutive doc_ids rather than absolute IDs — the deltas are small integers, compressible with variable-length encoding like VByte).

A 200 B page corpus generates roughly 2 PB of compressed index data, sharded across 500 machines. Each shard holds the posting lists for a disjoint subset of tokens (sharded by hash(token) % 500).

Index updates: Batch MapReduce runs nightly for the general web tier. For news/freshness tier, a streaming pipeline (Kafka + Dataflow) produces incremental index segments merged into the live index every 15 minutes using a log-structured merge (LSM) approach.

5c. Query Processing and Fan-out
#

When a query arrives:

  1. Parse: Tokenise, apply stemming/lemmatisation, expand synonyms, detect operators (site:, -, "").
  2. Fan-out (scatter): Send each query term to its responsible index shard in parallel. With 4-term queries across 500 shards, each query generates ~2 000 parallel RPCs (Remote Procedure Calls). The fan-out layer uses the hedge request pattern: if a shard’s response is not received within 50 ms (p95 latency), issue a second request to a replica. Cancel the slower one when either responds. This bounds tail latency at the cost of ~5% extra load.
  3. Merge: Each shard returns its top-K (top 1 000) doc_ids with BM25 (Best Match 25) scores. The merge layer intersects/unions these lists and applies the two-phase ranker.
  4. Rank (Phase 1 — BM25): A statistical relevance score using term frequency (TF) and inverse document frequency (IDF). Eliminates most docs; retains top 1 000.
  5. Rank (Phase 2 — ML re-ranker): A learned model (historically LambdaMART, now a Transformer-based model) applies hundreds of signals: PageRank, anchor text quality, freshness, user engagement signals, page speed. Produces final top 10.

5d. PageRank at Scale
#

PageRank models the web as a directed graph where each page’s score is the weighted sum of the scores of pages linking to it. The iterative formula converges in ~50 iterations. At 200 B nodes and ~1 T edges, this requires a distributed graph-processing framework.

Google’s original implementation was a MapReduce job that ran for hours. Modern implementations use Pregel (Google’s vertex-centric graph processing model, now open-sourced as Apache Giraph / Spark GraphX): each vertex holds its current rank and sends messages to its neighbours. Convergence is detected when the change in all vertex scores drops below a threshold.

PageRank is recomputed daily. Between recomputations, new pages receive a provisional score based on the average score of pages linking to them (a heuristic, not exact PR).


6. Data Model
#

documents (Raw Store — GCS / HDFS)
#

Object key: raw/{crawl_date}/{url_sha256}.gz

Field Type Notes
url STRING Canonical URL after redirect chain
fetched_at TIMESTAMP
http_status INT 200 / 301 / 404 etc.
raw_html BYTES (compressed) Full page HTML
simhash INT64 64-bit near-dup fingerprint
outlinks STRING[] Extracted outbound URLs

index_postings (per shard — in-memory / Bigtable)
#

Row key: {token} (within a shard that owns this token’s hash range)

Column Type Notes
posting_list BYTES Delta-encoded, VByte-compressed array of (doc_id, tf, positions[])
doc_freq INT64 Number of documents containing this term (for IDF calculation)
last_updated TIMESTAMP Time of last incremental merge

doc_metadata (Bigtable — keyed by doc_id)
#

Column Type Notes
url STRING Canonical URL
title STRING Extracted <title>
snippet STRING Pre-computed 160-char summary
pagerank FLOAT Latest PR score
language STRING ISO 639-1 language code
indexed_at TIMESTAMP Last index update
content_hash STRING SHA-256 of page body; change detection

link_graph (Bigtable — for PageRank) #

Row key: {source_doc_id}

Column Type Notes
out_edges INT64[] Target doc_ids (compressed)
in_degree INT64 Number of inbound links

7. Trade-offs
#

Index Sharding: by Term vs by Document
#

Option Pros Cons When to choose
Term-partitioned (shard by token hash) All postings for a term on one machine; no cross-shard intersection Hot terms (common words) create hot shards; fan-out to many shards per query Google’s original model; good when corpus fits in RAM per shard
Document-partitioned (shard by doc_id) Even load distribution; no hot shards Must intersect posting lists across all shards for every query term Large corpora where term-partitioned shards become too large

Conclusion: Google uses document-partitioned index at scale. Each shard holds all terms for a subset of documents; a query fans out to all shards, and each shard returns its local top-K. This balances load better at 200 B page scale.

Query Cache: Full Result Cache vs Posting List Cache
#

Option Pros Cons When to choose
Full result cache (Redis, key = normalised query) Zero fan-out on cache hit Low hit rate for long-tail queries; stale results Head queries (top 5% account for ~80% of traffic)
Posting list cache (cache hot posting lists in RAM on each shard) Helps all queries containing hot terms Still requires merge and rank on hit Universal — always worth doing

Conclusion: Both layers. Cache full results for top-1000 queries (very high hit rate). Cache hot posting lists in RAM on each index shard to speed up the fan-out step for the long tail.

Freshness vs Throughput: Streaming vs Batch Index Updates
#

Option Pros Cons When to choose
Batch MapReduce (nightly) High throughput; simple operational model 24-hour freshness lag General web tier
Streaming (Kafka + Dataflow, 15 min segments) Near-real-time freshness Operational complexity; segment merging overhead News / freshness tier
Real-time (per-page update on crawl) Immediate index update Very high write amplification; hard to maintain index quality Breaking news only — maintained as a separate “freshness index”

Conclusion: Tiered freshness. A small “freshness index” (top ~1 B pages, updated continuously) is merged with the main index at query time. The main index is rebuilt nightly via batch pipeline.


8. Failure Modes
#

Component Failure Impact Mitigation
Index shard crash Queries missing postings for that shard's token range Degraded result quality; missing documents Each shard has ≥ 2 replicas; fan-out load-balances across replicas; hedge requests detect slow shards
Hot term (e.g. "breaking news keyword") Single shard responsible for that term overwhelmed Latency spike for all queries containing that term Replicate hot-term postings to additional shard replicas; route queries round-robin across replicas; cache full posting list in RAM
Crawler overloads a domain Domain rate-limits or blocks crawler IP range Stale index for that domain Strict per-domain politeness (1 req/s); honour robots.txt Crawl-delay; exponential back-off on 429/503 responses
Index pipeline failure (batch job) Nightly index rebuild fails partway through Stale index served; no freshness degradation if old index kept live Blue-green index deployment: new index built offline, atomically swapped into serving when build completes and passes quality checks
Query fan-out slow shard (tail latency) One of 500 fan-out RPCs takes 500 ms Entire query latency blown Hedge request after 50 ms; cancel slower twin; serves result from whichever replica responds first
Spam / link farm injection Low-quality pages rank highly via artificial link schemes Degraded result quality TrustRank (seed from known-good domains); spam classifiers on crawled content; manual quality rater feedback loop

9. Security & Compliance
#

Bot Detection & Crawler Identity:

  • The crawler identifies itself via a known User-Agent string (Googlebot) and a verifiable IP range published in DNS (Domain Name System). Websites can verify crawler legitimacy via reverse-DNS lookup.
  • The crawler strictly honours robots.txt — fetching it first on every domain before any other page. Violations are logged and treated as bugs.

Query AuthN/AuthZ (Authentication / Authorization):

  • Anonymous queries are allowed for the public web search product. Rate limiting by IP (token bucket: 100 queries/min for anonymous; higher for authenticated API users).
  • Search API (Programmable Search Engine) requires OAuth 2.0 API keys with per-key QPS quotas.

Data Privacy:

  • Query logs are retained for 18 months (reduced from permanent after EU regulatory pressure); anonymised after 9 months (cookie/IP association removed).
  • Right to be Forgotten (EU GDPR Article 17): search results linking to specific pages can be de-indexed via a verified removal request. Removal is applied to the serving layer as a blocklist — the raw index is not rebuilt.

Content Safety:

  • SafeSearch filters are applied at the ranking layer: a classifier scores pages for adult/violent content; filtered pages are suppressed for SafeSearch-on queries.
  • CSAM (Child Sexual Abuse Material) detection: PhotoDNA-style perceptual hashing on crawled images; automatic de-indexing and reporting to NCMEC (National Center for Missing and Exploited Children).

Encryption:

  • All internal RPC (Remote Procedure Call) traffic between crawler, indexer, and serving is over mTLS (mutual TLS). Query results served over TLS 1.3.
  • Raw HTML store (GCS/HDFS) encrypted at rest with AES-256.

10. Observability
#

RED Metrics
#

Signal Metric Alert Threshold
Query rate search_queries_total < 70% of 7-day baseline (index outage signal)
Query error rate search_errors_total / total > 0.01%
Query latency (p99) query_duration_seconds > 200 ms
Fan-out latency (p99 per shard) shard_rtt_seconds > 100 ms
Crawl rate pages_fetched_per_second < 50% of target (crawler issue)
Index freshness lag index_lag_seconds (time since page published) > 30 min for news tier

Saturation Metrics
#

Resource Metric Alert Threshold
Index shard RAM shard_memory_utilisation > 85% (eviction risk)
Crawl queue depth url_frontier_size > 7-day rolling avg × 2 (backlog growing)
Merge layer CPU ranker_cpu_utilisation > 75%

Business Metrics
#

  • Click-through rate (CTR) on top-3 results: proxy for ranking quality; significant drop signals ranking regression.
  • Zero-result rate: fraction of queries returning no results; should be < 0.1%.
  • Index coverage: fraction of known live URLs that are indexed; target > 99%.

Tracing
#

Each query carries a search_trace_id propagated across fan-out RPCs. Slow-query traces (p99+ latency) are automatically sampled to Cloud Trace / Jaeger. Correlate with shard-level metrics to identify which shard caused tail latency.


11. Scaling Path
#

Phase 1 — MVP (< 1M indexed pages)
#

Single machine. SQLite full-text search (FTS5) or Elasticsearch single-node. Crawl with a simple Python script. No deduplication. Ranking: TF-IDF only. Serve queries directly from the index on the same machine.

What breaks first: Elasticsearch single-node tops out around 50M documents before search latency exceeds 200 ms. RAM becomes the bottleneck for posting list hot data.

Phase 2 — Growth (1M → 100M pages)
#

Elasticsearch cluster (5–10 nodes, document-partitioned). Dedicated crawler fleet (10–50 machines). Crawl queue in Redis. Simhash dedup. Nightly index rebuild. No ML ranker — pure BM25 + PageRank. Add Redis result cache.

What breaks first: PageRank computation. At 100M pages, a single-machine PageRank job takes hours. Migrate to Spark GraphX / Pregel.

Phase 3 — Scale (100M → 10B pages)
#

Custom sharded inverted index serving from RAM. Streaming index updates (Kafka + Flink) for freshness tier. ML re-ranker (LambdaMART) added as Phase 2 ranker. Fan-out gateway with hedge requests. Multi-region crawler (crawl from region nearest to the target domain). Blue-green index deployment.

What breaks first: Snippet generation at scale. Pre-compute and store snippets per (doc_id, query_cluster) rather than per raw query — cluster queries by topic and generate representative snippets offline.

Phase 4 — Hyperscale (10B → 200B+ pages)
#

Tiered index: freshness shard (top 1 B frequently-updated pages) + main shard (remaining 200 B). Tiered storage: hot posting lists in DRAM, warm in NVMe SSD, cold in HDD. Neural ranking model (BERT-based) for top-10 re-rank. Knowledge Graph overlay for entity queries. Distributed link graph with Pregel at 1 T edge scale. Per-query ML feature computation (real-time signals: query freshness intent, user geography).


12. Enterprise Considerations
#

Brownfield Integration:

  • Enterprise search products (Google Workspace Search, Elastic Enterprise Search, Microsoft SharePoint Search) crawl internal document stores (Confluence, SharePoint, Jira, GDrive) rather than the public web. The same inverted index and BM25 ranking apply; the crawl component is replaced with API-based connectors per data source.

Build vs Buy:

  • For a startup / internal search engine: Elasticsearch or OpenSearch — mature, operationally well-understood, pluggable ranking.
  • For 1 B+ docs: consider Apache Solr with SolrCloud for sharding, or a custom serving layer built on Apache Lucene (the index library that underpins both Elasticsearch and Solr).
  • Web crawler: Apache Nutch (open source, production-grade) or Scrapy (Python, simpler). Never build a production crawler from scratch — robots.txt compliance, dedup, and politeness are deceptively hard.
  • ML ranker: LightGBM / XGBoost for LTR (Learning to Rank) on small corpora; Transformer-based models (BERT, T5) for re-ranking at scale.
  • PageRank / link analysis: Apache Spark GraphX or Google’s Pregel (via Dataproc).

Multi-Tenancy:

  • SaaS enterprise search: each tenant gets a logically isolated index namespace. Shared serving infrastructure but strict per-tenant data isolation (no cross-tenant posting list access).
  • Noisy neighbour: a single tenant running expensive analytical queries (site: operator across millions of docs) can starve other tenants. Enforce per-tenant QPS and query complexity limits.

TCO Ballpark (at 10 B page scale):

  • Index storage (10 TB compressed): ~$200/month on NVMe-backed Bigtable.
  • Serving fleet (500 shards × 2 replicas × 8-core machines): ~$150K/month.
  • Crawl fleet (1 000 fetcher instances): ~$30K/month.
  • PageRank MapReduce (daily, 100-node Dataproc cluster, 4 hours): ~$2K/month.
  • Elasticsearch alternative at 10 B docs: similar cost but higher operational complexity.

Conway’s Law Implication: Search engines almost always split into separate teams along the crawl/index/rank/serve boundary. The index format is the contract between the index team and the serving team — treat it as a versioned API and never break backward compatibility mid-deployment.


13. Interview Tips
#

  • Sketch the four-stage pipeline first. Crawl → index → rank → serve. Interviewers want to see that you understand the system has both an offline pipeline (building the index) and an online system (serving queries). Conflating them is the most common mistake.
  • Explain the inverted index before any other data structure. Every search system worth designing is built on an inverted index. Know what a posting list is, how it is compressed (delta encoding + VByte), and why it is sorted by doc_id (enables merge intersection in linear time).
  • Don’t forget politeness. Candidates often design a crawler that would DDoS every website it visits. Know about robots.txt, Crawl-delay, per-domain rate limiting, and exponential back-off. Interviewers at companies with real crawlers will probe this.
  • Nail the fan-out latency problem. “How do you keep query latency under 200 ms when you fan out to 500 shards?” The answer is hedge requests (backup requests to slow shards) + serving the top-K from each shard rather than full posting lists. This demonstrates operational sophistication.
  • Vocabulary that signals fluency: inverted index, posting list, TF-IDF (Term Frequency–Inverse Document Frequency), BM25, PageRank, Simhash, Bloom filter, delta encoding, VByte compression, URL Frontier, politeness budget, hedge request, blue-green index swap, LTR (Learning to Rank), freshness tier, document-partitioned vs term-partitioned index.

14. Further Reading
#

  • “The Anatomy of a Large-Scale Hypertextual Web Search Engine” (Brin & Page, 1998): The original Google paper. Explains PageRank, the inverted index design, and the two-server architecture. Still required reading — most of the ideas hold up 25 years later.
  • “MapReduce: Simplified Data Processing on Large Clusters” (Dean & Ghemawat, OSDI 2004): The batch processing primitive that powers the index build pipeline. The paper is short and concrete.
  • “Pregel: A System for Large-Scale Graph Processing” (Malewicz et al., SIGMOD 2010): Google’s distributed PageRank computation framework. Covers the vertex-centric model and barrier synchronisation.
  • “Challenges in Building Large-Scale Information Retrieval Systems” (Dean, WSDM 2009): Google’s own retrospective on scaling from the original paper to 2009. Covers the transition from term-partitioned to document-partitioned index.
  • Elasticsearch documentation — “Inside a Shard”: A well-written walkthrough of how Lucene segments work, how merges happen, and how Elasticsearch distributes them. Useful as a concrete reference when discussing index architecture.

Related

Dropbox / Google Drive — Distributed File Sync at Scale

1. Hook # In 2011, Dropbox engineers discovered that roughly 70% of all uploaded data was already on their servers — users syncing the same PDFs, stock photos, and installer packages. Switching from file-level to block-level deduplication immediately cut bandwidth costs by more than two-thirds. That insight defines the whole discipline of cloud file sync: the hard problems are not storage capacity or even bandwidth, but delta detection, deduplication, conflict resolution, and consistency across an arbitrarily large fleet of devices. Google Drive went further, embedding a collaborative editing layer (Docs, Sheets, Slides) on top of the same blob store. Today both systems handle hundreds of millions of users, billions of files, and near-real-time sync across mobile, desktop, and web clients — often over flaky connections.

Google Docs — Real-Time Collaborative Editing at Scale

1. Hook # In 2006, Google acquired Writely and within two years turned it into Google Docs — the first mainstream product that let multiple people type in the same document at the same time without locking or “check-out” workflows. The core problem sounds deceptively simple: if Alice deletes character 5 while Bob inserts a character at position 4, whose version wins? The naïve answer (“last write wins”) produces corrupted documents. The real answer — Operational Transformation (OT) — is the algorithm that makes collaborative editing feel like magic, and it is one of the most subtle distributed-systems problems you will encounter in an interview. Every major collaborative editor (Google Docs, Notion, Figma, Microsoft 365) is built on either OT or its younger sibling CRDT (Conflict-free Replicated Data Type). Understanding which to use, and why, separates candidates who have thought deeply about consistency from those who have memorised buzzwords.

Netflix — Video Streaming Platform

1. Hook # At peak, Netflix accounts for 15% of global internet downstream traffic — roughly 700 Gbps flowing to subscribers in 190 countries. What makes this feasible is not raw bandwidth: it is a carefully engineered pipeline that converts every raw title into over 1,200 encoded video files before a single subscriber presses play, then serves those files from ISP-embedded appliances called Open Connect Appliances (OCA) rather than from a traditional cloud CDN. The streaming experience you see — where the picture quality silently improves while you watch — is ABR (Adaptive Bitrate) streaming dynamically switching between those pre-encoded variants based on your network conditions. Behind the personalised rows on the homepage sits a recommendation engine that runs 45+ algorithms to surface the title you are most likely to start watching in the next 30 seconds. Each of these subsystems operates at a scale where a 0.1% drop in streaming reliability translates to 250,000 subscribers unable to watch at that moment.

Uber / Ride-Sharing System

1. Hook # Every time someone taps “Request Ride” on Uber, the platform must answer a deceptively hard spatial query in under a second: which of the thousands of nearby drivers is the best match for this rider, given their location, heading, vehicle type, and current workload? Uber processes 25 million trips per day across 70+ countries, with peak demand spikes during commute hours, concerts, and bad weather — all of which arrive simultaneously in the same city blocks.

YouTube — Video Upload, Transcoding & Global Delivery

1. Hook # Every minute, creators upload 500 hours of video to YouTube — roughly 720,000 hours of raw footage per day that must be validated, transcoded into 10+ adaptive formats, and made globally available before viewers ever click play. Unlike Netflix (a closed catalogue of licensed titles transcoded offline), YouTube is a live upload platform: a creator in Lagos hits “publish” and expects global playback within minutes. The upload pipeline, transcoding infrastructure, and two-tier CDN (Content Delivery Network) that make this possible are among the most complex media-engineering systems on the planet. On the consumption side, 2 billion+ logged-in users watch over 1 billion hours of video daily — a recommendation challenge that dwarfs most advertising systems in latency sensitivity and business impact. If the recommendation model serves the wrong video, engagement drops; if the transcoder stalls, creators lose monetisation time.

WhatsApp / Chat Messaging System

1. Hook # WhatsApp delivers 100 billion messages every day to 2 billion users across 180+ countries — all end-to-end encrypted (E2EE), with sub-second latency, and with a global engineering team historically smaller than 50 engineers. The system does this while providing strong delivery guarantees (a message is either delivered exactly once or the sender knows it was not), preserving per-conversation message ordering even when users switch networks mid-send, and maintaining ephemeral server storage so that once a message is delivered it lives only on client devices.