Google Docs — Real-Time Collaborative Editing at Scale

Table of Contents

1. Hook
#

In 2006, Google acquired Writely and within two years turned it into Google Docs — the first mainstream product that let multiple people type in the same document at the same time without locking or “check-out” workflows. The core problem sounds deceptively simple: if Alice deletes character 5 while Bob inserts a character at position 4, whose version wins? The naïve answer (“last write wins”) produces corrupted documents. The real answer — Operational Transformation (OT) — is the algorithm that makes collaborative editing feel like magic, and it is one of the most subtle distributed-systems problems you will encounter in an interview. Every major collaborative editor (Google Docs, Notion, Figma, Microsoft 365) is built on either OT or its younger sibling CRDT (Conflict-free Replicated Data Type). Understanding which to use, and why, separates candidates who have thought deeply about consistency from those who have memorised buzzwords.

2. Problem Statement
#

Functional Requirements
#

Multiple users can edit the same document concurrently; changes from all users appear in near real-time.
The document converges to the same state on all clients regardless of network delays or operation ordering.
Cursor positions and selection ranges of collaborators are visible in real-time.
Full revision history is maintained; any previous state can be restored.
Documents can be shared with configurable permissions (viewer / commenter / editor / owner).
Offline editing is supported; changes sync when connectivity is restored.

Non-Functional Requirements
#

Attribute	Target
Operation propagation latency (p95)	< 200 ms for users in the same region
Convergence guarantee	All clients reach identical state eventually
Document availability	99.99% (reads/writes must not block on collaborator failures)
Revision history retention	Indefinite (all-time, compressed)
Concurrent editors per document	Up to ~100 simultaneous editors
Document size	Up to ~1 M characters (soft cap)

Out of Scope
#

Real-time voice/video (Google Meet integration is a separate service)
Spreadsheet formula evaluation (Sheets-specific computation engine)
Presentation rendering (Slides-specific layout engine)
Mobile-specific offline-first sync protocol details

3. Scale Estimation
#

Assumptions:

3B registered Google accounts; ~500M DAU (Daily Active Users) touching Google Workspace.
1B documents in existence; ~50M documents actively edited per day.
Average concurrent editors per active document: 2–3; peak for viral/shared docs: ~100.
Average operation size: 20 bytes (insert/delete + position + metadata).
Operations per active user per minute: ~60 (one keystroke per second).
Revision snapshots: full checkpoint every 100 operations.

Metric	Calculation	Result
Active editing sessions	50M docs/day × avg 3 editors = 150M sessions	~1 740 sessions/s peak
Operations/second	150M sessions × 60 ops/min / 60 s	~150 000 ops/s
Operation payload/s	150 000 × 20 bytes	~3 MB/s (tiny)
WebSocket connections	150M concurrent sessions (peak day)	~1.74M persistent connections/s
Revision log storage/day	150 000 ops/s × 86 400 s × 20 bytes	~260 GB/day
Snapshot storage	1B docs × avg 10 KB (compressed)	~10 TB total

The bottleneck is not storage or CPU — it is maintaining millions of long-lived WebSocket connections and ordering concurrent operations per document without a global lock.

4. High-Level Design
#

The architecture separates three concerns: the real-time collaboration session (OT engine + WebSocket gateway), the persistent document store (operation log + snapshots), and the metadata layer (sharing, permissions, file tree).

flowchart TD
    subgraph Clients["Clients (Browser / Mobile)"]
        C1[Alice — Chrome]
        C2[Bob — Chrome]
        C3[Carol — Mobile]
    end

    subgraph Gateway["WebSocket Gateway\n(regional, sticky sessions)"]
        WS[WebSocket Server\nper-document session]
    end

    subgraph Collab["Collaboration Service"]
        OT[OT Engine\noperation transform + apply]
        SQ[Operation Sequencer\nper-document mutex]
    end

    subgraph Storage["Storage Layer"]
        OL[(Operation Log\nBigtable / Spanner)]
        SN[(Snapshot Store\nGCS — full doc every 100 ops)]
        MC[(Metadata DB\nSpanner — sharing, perms)]
    end

    subgraph Presence["Presence Service"]
        PR[Cursor / Selection\nfan-out]
    end

    C1 <-->|WebSocket| WS
    C2 <-->|WebSocket| WS
    C3 <-->|WebSocket| WS
    WS -->|raw op + client revision| OT
    OT <-->|lock + sequence| SQ
    OT -->|transformed op| WS
    OT -->|append| OL
    OT -->|periodic| SN
    WS <-->|presence events| PR
    MC -->|ACL check| OT

Write path: Client sends an operation tagged with the revision number it was based on → OT Engine transforms the op against all concurrent ops since that revision → assigns a global sequence number → broadcasts the transformed op to all other clients in the session → appends to the Operation Log.

Read path (document load): Fetch the nearest snapshot ≤ target revision from GCS (Google Cloud Storage) → replay operations from the Operation Log since that snapshot → reconstruct current document state.

Component Roles
#

Component	Responsibility	Key Choice
WebSocket Gateway	Maintain persistent connections; route ops to correct document session	Sticky sessions per document — all editors of doc X land on same server shard
OT Engine	Transform and apply concurrent operations; maintain server-authoritative document state	Jupiter OT algorithm (used by Google); single server state simplifies transform functions
Operation Sequencer	Per-document serialisation point; assigns monotonic revision numbers	In-memory mutex per document; single leader per document shard
Operation Log	Append-only log of every transformed operation; source of truth for history	Bigtable keyed by (doc_id, revision); ordered scan for replay
Snapshot Store	Full document state checkpoint every N operations; avoids replaying the entire log on load	GCS blob; JSON or Protobuf serialised; snapshot every 100 ops
Presence Service	Broadcast cursor positions, selections, and user avatars to collaborators	Ephemeral; stored in Redis with 10 s TTL (Time-To-Live); not persisted

5. Deep Dive — Critical Components
#

5a. Operational Transformation
#

OT is built on two properties:

Convergence: All clients that receive the same set of operations (in any order) must reach the same document state.
Intention preservation: The meaning of an operation must be honoured even after transformation.

The simplest example: a document contains "ab".

Alice sends Insert('c', position=1) → intended: "acb".
Bob sends Delete(position=0) → intended: "b".
Server receives Alice’s op first (revision 1), then Bob’s (revision 1, concurrent).

Bob’s Delete(0) was formed when the doc was "ab". After Alice’s insert, the doc is "acb". Bob intended to delete 'a' — still at position 0. No transformation needed here. But if Bob had sent Delete(position=1) (delete 'b'), that position must be shifted to 2 after Alice’s insert. The transform function T(op_b, op_a) produces the adjusted operation.

Google Docs uses the Jupiter protocol (Nichols et al., 1995): a client-server model where the server is the single serialisation point. This eliminates the need for peer-to-peer transform functions (which are notoriously hard to prove correct for complex operations). Each client maintains:

A local document state.
A queue of unacknowledged operations.
The server revision it last saw.

When an op arrives from the server that was concurrent with an unacknowledged local op, the client transforms the server op against its local queue before applying it.

sealed interface Op permits Insert, Delete {}

record Insert(int position, String text) implements Op {}
record Delete(int position, int length) implements Op {}

final class Transform {
    // Transform op2 as if op1 had already been applied.
    static Op transform(Op op2, Op op1) {
        return switch (op1) {
            case Insert i -> transformAgainstInsert(op2, i);
            case Delete d -> transformAgainstDelete(op2, d);
        };
    }

    private static Op transformAgainstInsert(Op op2, Insert i) {
        return switch (op2) {
            case Insert ins -> ins.position() <= i.position()
                ? ins
                : new Insert(ins.position() + i.text().length(), ins.text());
            case Delete del -> del.position() < i.position()
                ? del
                : new Delete(del.position() + i.text().length(), del.length());
        };
    }

    private static Op transformAgainstDelete(Op op2, Delete d) {
        return switch (op2) {
            case Insert ins -> ins.position() <= d.position()
                ? ins
                : new Insert(Math.max(d.position(), ins.position() - d.length()), ins.text());
            case Delete del -> {
                if (del.position() >= d.position() + d.length())
                    yield new Delete(del.position() - d.length(), del.length());
                // Overlapping deletes: clamp
                int newPos = Math.min(del.position(), d.position());
                int newLen = Math.max(0, del.length() - Math.max(0,
                    d.position() + d.length() - del.position()));
                yield new Delete(newPos, newLen);
            }
        };
    }
}

5b. Operation Sequencer and Per-Document Locking
#

A document’s operations must be totally ordered. The sequencer is a single in-process lock per document on the Collaboration Service instance responsible for that document. Operations arrive concurrently from multiple WebSocket connections; the sequencer serialises them, assigns a monotonically increasing revision number, transforms each incoming op against all ops since its base revision, and broadcasts the result.

Because the sequencer is in-memory, a crash loses in-flight ops. The mitigation: clients buffer sent ops and resend them if they do not receive an acknowledgement within 5 s. The sequencer is idempotent for ops with the same client-generated UUID (Universally Unique Identifier).

5c. Document Load and Snapshot Replay
#

On document open:

Fetch metadata (permissions, title, current revision number) from Spanner — fast, < 10 ms.
Fetch the nearest snapshot ≤ current revision from GCS.
Fetch operations in the range (snapshot_revision, current_revision] from Bigtable.
Apply operations to snapshot state to reconstruct current document.

With a snapshot every 100 ops, step 3 fetches at most 100 rows — typically < 5 ms. For documents with millions of operations but regular snapshots, this pattern keeps load times bounded regardless of document age.

5d. Presence and Cursor Awareness
#

Cursor positions are ephemeral and high-frequency (every mouse move, every keystroke selection). They are not stored in the Operation Log. Instead:

Clients send cursor-update messages over the same WebSocket connection at most every 50 ms.
The WebSocket Gateway fans these out directly to all other clients in the session without writing to any database.
Redis holds the last-known cursor for each user in a session with a 10 s TTL; used only when a new collaborator joins mid-session and needs to hydrate the initial presence state.

Cursor positions must also be transformed as remote operations arrive — an insert before Alice’s cursor shifts her position forward. The client-side OT engine handles this identically to document operations.

6. Data Model
#

`documents` table (Spanner)
#

Column	Type	Notes
`doc_id`	STRING(36) PK	UUID
`owner_user_id`	STRING(36)
`title`	STRING(1024)
`current_revision`	INT64	Monotonically increasing
`created_at`	TIMESTAMP
`last_modified_at`	TIMESTAMP

`operations` table (Bigtable)
#

Row key: {doc_id}#{revision:010d} (zero-padded for lexicographic scan)

Column	Type	Notes
`op_type`	STRING	`insert` / `delete` / `format`
`payload`	BYTES	Protobuf-serialised operation
`author_user_id`	STRING
`client_op_id`	STRING	Idempotency key
`timestamp`	TIMESTAMP	Server commit time

Why Bigtable here: Append-only, high-throughput write, ordered range scan by (doc_id, revision) — exactly the Bigtable sweet spot. No updates, no deletes (log is immutable).

`snapshots` (GCS)
#

Object key: snapshots/{doc_id}/{revision}.pb.gz

Contains: full document content (Protobuf), the revision number at snapshot time, and a SHA-256 checksum. Immutable once written.

`shares` table (Spanner)
#

Column	Type	Notes
`share_id`	STRING PK
`doc_id`	STRING FK
`grantee_user_id`	STRING	Nullable for link shares
`role`	STRING	`viewer` / `commenter` / `editor` / `owner`
`expires_at`	TIMESTAMP	Nullable

7. Trade-offs
#

OT vs CRDT
#

Option	Pros	Cons	When to choose
OT (Jupiter / Google Wave)	Proven at scale; intention-preserving; works well for rich text	Requires a central server for total ordering; transform functions are hard to write correctly for complex types	Central-server architectures; rich text with formatting
CRDT (e.g. Yjs, Automerge)	Fully peer-to-peer; no central sequencer needed; simpler convergence proofs	Higher memory overhead (tombstones for deleted chars); harder to implement rich formatting intentions	P2P / offline-first apps; local-first architectures

Conclusion: Google Docs uses OT with a central server — it was the right choice in 2006 and remains so because it enables a single authoritative history. New entrants (Notion, Linear) often choose Yjs (a CRDT library) for its offline-first properties. Neither is universally superior.

Per-Document vs Global Sequencer
#

Option	Pros	Cons	When to choose
Per-document in-memory sequencer	Zero coordination overhead between documents; horizontally scalable	Single point of failure per document; state lost on crash	Google Docs model — documents are independent
Global distributed sequencer (Zookeeper / Spanner)	Durable; survives sequencer crashes without client buffering	High latency for every operation; cross-document ordering not needed	Multi-entity transactional systems

Conclusion: Per-document sequencer wins because documents are fully independent units. Crash recovery is handled by client-side op buffering and re-delivery, not by durable distributed consensus on the hot path.

Operation Log vs Full-State Storage
#

Option	Pros	Cons	When to choose
Append-only operation log	Full revision history for free; easy audit; compact	Document load requires replay; replay time grows with doc age	Always use this for collaborative editors
Full-state snapshots only	Fast load	No history; large storage; no conflict resolution	Not suitable for collaborative docs

Conclusion: Use both — log as source of truth, periodic snapshots to bound load time.

8. Failure Modes
#

Component	Failure	Impact	Mitigation
Collaboration Service (sequencer crash)	In-flight ops lost	Clients briefly see stale state; acknowledged ops replayed	Clients buffer all unacknowledged ops; resend on reconnect to new sequencer; last committed revision from Bigtable is the recovery point
WebSocket Gateway crash	All sessions on that node disconnected	Clients auto-reconnect; 2–5 s visible disruption	Client reconnect with exponential backoff; session state is in the sequencer, not the gateway
Bigtable write failure	Op transformed and broadcast but not persisted	Data loss if sequencer also crashes before retry	Write to Bigtable synchronously before ACK-ing the client; do not broadcast until persisted
Network partition (client offline)	Client edits locally; diverges from server	Conflict on reconnect if others edited same region	Client queues ops with local revision; on reconnect, transforms queued ops against server ops since last ack'd revision — standard OT recovery
Hot document (100 concurrent editors)	Single sequencer becomes a CPU bottleneck	Operation latency spikes	Rate-limit op frequency per client (max 10 ops/s); debounce fast typists; soft cap of 100 simultaneous editors
Corrupted snapshot	Document load fails or produces wrong state	Document unreadable	Verify snapshot checksum on load; fall back to previous checkpoint and replay more ops; checksums validated on write

9. Security & Compliance
#

AuthN/AuthZ (Authentication / Authorization):

Every WebSocket connection is authenticated via an OAuth 2.0 access token checked on handshake. Tokens are short-lived (1 hour); the WebSocket keeps a heartbeat to renew.
Each operation is checked against the shares table ACL (Access Control List) before being accepted by the OT Engine. A viewer role causes the connection to be read-only — ops are silently dropped server-side and the client is notified.

Encryption:

TLS (Transport Layer Security) 1.3 for all WebSocket traffic.
At rest: Bigtable and GCS encrypted with AES-256 (Advanced Encryption Standard 256-bit); Google-managed keys by default, CMEK (Customer-Managed Encryption Keys) available for Workspace Enterprise.

Input Validation:

Operations are validated for structural correctness (position within document bounds, non-negative length) before entering the OT Engine. Malformed ops are rejected with an error code; they never reach the sequencer.
Document size is enforced: attempts to insert content beyond the 1M-character soft cap return a quota error.

GDPR / Right to Erasure:

Deleting a document triggers async purge of all Bigtable rows for that doc_id and deletion of all GCS snapshots. Because the operation log is the only copy of the content (no cross-user deduplication unlike file sync), deletion is complete.
Shared link tokens are invalidated immediately on permission revocation.

Audit Log:

Document access events (open, edit, download, share) streamed to Google Vault (an immutable audit log product). Required for Workspace Enterprise SOC 2 (System and Organization Controls 2) compliance.

10. Observability
#

RED Metrics (Rate / Errors / Duration)
#

Signal	Metric	Alert Threshold
Op acceptance rate	`ops_committed_total` per doc_id	Sudden drop → sequencer issue
Op rejection rate	`ops_rejected_total / ops_received_total`	> 0.1%
Op round-trip latency (p99)	`op_rtt_seconds` (client sends → receives ACK)	> 500 ms
WebSocket connection drops	`ws_disconnect_total` rate	> 2× baseline
Document load time (p95)	`doc_load_duration_seconds`	> 3 s

Saturation Metrics
#

Resource	Metric	Alert Threshold
Sequencer CPU	Per-document op queue depth	> 50 pending ops
Bigtable write throughput	Rows written/s vs tablet capacity	> 80%
WebSocket connections per gateway node	`active_connections`	> 50 000

Business Metrics
#

Collaboration session length: median time two or more users are simultaneously active in a document.
Conflict rate: fraction of ops that required non-trivial transformation (position delta > 0) — a proxy for how often concurrent edits happen.
Offline edit rate: fraction of sessions that submitted buffered ops on reconnect — informs offline sync investment.

Tracing
#

Each operation carries a trace_id (OpenTelemetry). The trace spans: client SDK → WebSocket Gateway → OT Engine → Bigtable write → broadcast. P99 traces for slow operations are automatically sampled and sent to Cloud Trace for root-cause analysis.

11. Scaling Path
#

Phase 1 — MVP (< 10K DAU)
#

Single-region. One Collaboration Service instance handles all documents sequentially. PostgreSQL stores both operations and snapshots. WebSocket connections on the same process. Simple broadcast: iterate connected clients.

What breaks first: A single process cannot maintain tens of thousands of WebSocket connections and run the OT engine under load. Node / Netty-based async I/O helps, but the single-threaded sequencer becomes a bottleneck around 1 000 concurrent editing sessions.

Phase 2 — Growth (10K → 500K DAU)
#

Shard documents across a fleet of Collaboration Service instances by doc_id hash. Introduce a load balancer that routes WebSocket connections for the same document to the same instance (consistent hashing). Migrate operation log to Bigtable. Add Redis for presence state. Add GCS snapshot pipeline.

What breaks first: Hot documents (team meeting notes, shared templates) concentrate load on one shard. Add a per-document rate limiter and soft cap on concurrent editors.

Phase 3 — Scale (500K → 10M DAU)
#

Multi-region deployment with regional sequencers. Documents are “homed” to a region (the region where the first editor opened the document). Cross-region latency is accepted for collaborators in other regions (~100–150 ms extra RTT (Round-Trip Time)). Add a CDN-cached read path for document load (snapshot + last N ops cached at edge for 5 s).

What breaks first: Cross-region collaboration on the same document. For truly latency-sensitive use cases, explore CRDT-based replication between regional sequencers, accepting eventual (not immediate) convergence across regions.

Phase 4 — Hyperscale (10M → 500M DAU)
#

Per-tenant regional affinity (EU data residency for GDPR). Automated snapshot frequency tuning (more frequent snapshots for hot documents; sparse for cold ones). ML-based prediction of document activity spikes (pre-warm sequencer instances before large meetings). Tiered operation log storage (Bigtable for recent; BigQuery for historical analytics).

12. Enterprise Considerations
#

Brownfield Integration:

Enterprises already have SharePoint / Confluence / Office 365. Google Workspace migration tooling imports .docx files into Docs format, converting the binary format into an initial snapshot — the operation log starts from version 1 at import time.

Build vs Buy:

OT Engine: build (no general-purpose OT library handles rich text formatting correctly at Google scale; the transform functions are domain-specific).
WebSocket infrastructure: build on top of Netty / gRPC streaming / Cloud Run WebSockets. Do not use Socket.IO at scale — its fallback mechanisms add complexity.
Operation Log: Bigtable or DynamoDB Streams. Cassandra is viable but requires careful compaction tuning for append-heavy workloads.
Presence: Redis Pub/Sub or Pusher for early stages; build in-house once connection counts exceed 100K.

Multi-Tenancy:

Each enterprise customer’s documents are isolated in a separate Bigtable instance (or at minimum a separate key prefix with IAM (Identity and Access Management) boundaries). Noisy-neighbour risk: a large enterprise generating millions of ops/s must not degrade other tenants.
Data residency: GDPR-regulated customers require EU-only Bigtable and GCS buckets. The document-homing-to-region model (Phase 3) handles this; the metadata service enforces region affinity on first open.

TCO (Total Cost of Ownership) Ballpark (at 50M DAU):

Bigtable: ~$0.065/GB/month for storage + ~$0.026/1M reads. At 260 GB/day ingest: ~$600K/month.
GCS snapshots: ~$0.02/GB. At 10 TB: ~$200/month (negligible).
Collaboration Service compute: ~10 000 cores at peak (50M sessions / 5 000 sessions per 8-core node) → ~$100K/month on preemptible instances.
Spanner (metadata): ~$0.9/node/hour, 10 nodes → ~$6 500/month.

Conway’s Law Implication: The clean split between the OT Engine team and the WebSocket Gateway team almost always produces an internal API boundary that mirrors the on-wire protocol. Keep that protocol versioned — clients in the wild run old versions for months.

13. Interview Tips
#

Start with the convergence problem. Don’t jump to architecture — first explain why concurrent edits are hard (Alice and Bob both think position 5 is the right place; after the other’s op, it isn’t). This shows you understand the core difficulty.
Know both OT and CRDT at a high level. You don’t need to implement either from scratch, but you must be able to say: “OT needs a central server; CRDT doesn’t but costs more memory.” Know that Google Docs uses OT and that Figma / Notion lean toward CRDT (Yjs).
Separate the operation log from the document state. Many candidates store the “current document” as a mutable blob. The correct answer is an immutable, append-only log of operations with periodic snapshots. This also gives revision history for free.
Nail the failure scenario. “What happens when a client goes offline for 10 minutes and then reconnects with 500 buffered ops?” Walk through: client sends first buffered op with base revision R; server has advanced to R+200; server transforms the client op against all 200 server ops since R; client receives the transformed ops and replays locally. This is the heart of OT.
Vocabulary that signals fluency: Operational Transformation, Jupiter protocol, intention preservation, convergence, CRDT (Conflict-free Replicated Data Type), tombstone, operational log, snapshot-and-replay, sticky WebSocket session, cursor transformation, presence fan-out, idempotent operation delivery.

14. Further Reading
#

“High-Latency, Low-Bandwidth Windowing in the Jupiter Collaboration System” (Nichols et al., UIST 1995): The original Jupiter OT paper. Short (8 pages) and the theoretical foundation for Google Docs. Read this before any interview.
Yjs CRDT library (Kevin Jahns): The leading open-source CRDT for collaborative text editing. Its README explains why the author chose CRDT over OT and the memory trade-offs involved — essential reading for understanding the alternative.
Google Wave “Federation Protocol” (2009): Google’s open-source attempt to federate collaborative editing across servers. The protocol whitepaper explains multi-server OT, which is significantly harder than single-server OT and explains why Wave was ultimately discontinued.
“Logoot: A Scalable Optimistic Replication Algorithm for Collaborative Editing on P2P Networks” (Weiss et al., 2009): The original sequence CRDT paper; foundational for understanding how CRDTs solve the same problem OT solves, but with different trade-offs.

Dropbox / Google Drive — Distributed File Sync at Scale

28 April 2026·16 mins

System Design Classic System-Design Classic

1. Hook # In 2011, Dropbox engineers discovered that roughly 70% of all uploaded data was already on their servers — users syncing the same PDFs, stock photos, and installer packages. Switching from file-level to block-level deduplication immediately cut bandwidth costs by more than two-thirds. That insight defines the whole discipline of cloud file sync: the hard problems are not storage capacity or even bandwidth, but delta detection, deduplication, conflict resolution, and consistency across an arbitrarily large fleet of devices. Google Drive went further, embedding a collaborative editing layer (Docs, Sheets, Slides) on top of the same blob store. Today both systems handle hundreds of millions of users, billions of files, and near-real-time sync across mobile, desktop, and web clients — often over flaky connections.

Search Engine — Google-Scale Crawl, Index, Rank, and Serve

28 April 2026·18 mins

System Design Classic System-Design Classic

1. Hook # Google processes 8.5 billion searches per day — roughly 99 000 queries per second at peak — and returns results in under 200 ms. Behind that sub-second response is a pipeline that never fully stops: a web crawler perpetually downloading ~20 billion pages, a MapReduce-scale indexing system converting raw HTML into a compressed inverted index, a multi-stage ranking pipeline that scores hundreds of signals in milliseconds, and a serving layer that shards the index across thousands of machines so no single query touches more than a fraction of the corpus. Building a search engine from scratch is perhaps the canonical “design a distributed system” problem because it combines almost every hard problem in the field: distributed crawling, large-scale data processing, near-real-time index updates, low-latency high-throughput query serving, and machine learning (ML)-based ranking. Even a simplified version at 1/1000th of Google’s scale teaches you more about distributed systems than almost any other exercise.

Netflix — Video Streaming Platform

27 April 2026·17 mins

System Design Classic System-Design Classic

1. Hook # At peak, Netflix accounts for 15% of global internet downstream traffic — roughly 700 Gbps flowing to subscribers in 190 countries. What makes this feasible is not raw bandwidth: it is a carefully engineered pipeline that converts every raw title into over 1,200 encoded video files before a single subscriber presses play, then serves those files from ISP-embedded appliances called Open Connect Appliances (OCA) rather than from a traditional cloud CDN. The streaming experience you see — where the picture quality silently improves while you watch — is ABR (Adaptive Bitrate) streaming dynamically switching between those pre-encoded variants based on your network conditions. Behind the personalised rows on the homepage sits a recommendation engine that runs 45+ algorithms to surface the title you are most likely to start watching in the next 30 seconds. Each of these subsystems operates at a scale where a 0.1% drop in streaming reliability translates to 250,000 subscribers unable to watch at that moment.

YouTube — Video Upload, Transcoding & Global Delivery

27 April 2026·16 mins

System Design Classic System-Design Classic

1. Hook # Every minute, creators upload 500 hours of video to YouTube — roughly 720,000 hours of raw footage per day that must be validated, transcoded into 10+ adaptive formats, and made globally available before viewers ever click play. Unlike Netflix (a closed catalogue of licensed titles transcoded offline), YouTube is a live upload platform: a creator in Lagos hits “publish” and expects global playback within minutes. The upload pipeline, transcoding infrastructure, and two-tier CDN (Content Delivery Network) that make this possible are among the most complex media-engineering systems on the planet. On the consumption side, 2 billion+ logged-in users watch over 1 billion hours of video daily — a recommendation challenge that dwarfs most advertising systems in latency sensitivity and business impact. If the recommendation model serves the wrong video, engagement drops; if the transcoder stalls, creators lose monetisation time.

1. Hook #

2. Problem Statement #

Functional Requirements #

Non-Functional Requirements #

Out of Scope #

3. Scale Estimation #

4. High-Level Design #

Component Roles #

5. Deep Dive — Critical Components #

5a. Operational Transformation #

5b. Operation Sequencer and Per-Document Locking #

5c. Document Load and Snapshot Replay #

5d. Presence and Cursor Awareness #

6. Data Model #

documents table (Spanner) #

operations table (Bigtable) #

snapshots (GCS) #

shares table (Spanner) #

7. Trade-offs #

OT vs CRDT #

Per-Document vs Global Sequencer #

Operation Log vs Full-State Storage #

8. Failure Modes #

9. Security & Compliance #

10. Observability #

RED Metrics (Rate / Errors / Duration) #

Saturation Metrics #

Business Metrics #

Tracing #

11. Scaling Path #

Phase 1 — MVP (< 10K DAU) #

Phase 2 — Growth (10K → 500K DAU) #

Phase 3 — Scale (500K → 10M DAU) #

Phase 4 — Hyperscale (10M → 500M DAU) #

12. Enterprise Considerations #

13. Interview Tips #

14. Further Reading #

Related

1. Hook
#

2. Problem Statement
#

Functional Requirements
#

Non-Functional Requirements
#

Out of Scope
#

3. Scale Estimation
#

4. High-Level Design
#

Component Roles
#

5. Deep Dive — Critical Components
#

5a. Operational Transformation
#

5b. Operation Sequencer and Per-Document Locking
#

5c. Document Load and Snapshot Replay
#

5d. Presence and Cursor Awareness
#

6. Data Model
#

`documents` table (Spanner)
#

`operations` table (Bigtable)
#

`snapshots` (GCS)
#

`shares` table (Spanner)
#

7. Trade-offs
#

OT vs CRDT
#

Per-Document vs Global Sequencer
#

Operation Log vs Full-State Storage
#

8. Failure Modes
#

9. Security & Compliance
#

10. Observability
#

RED Metrics (Rate / Errors / Duration)
#

Saturation Metrics
#

Business Metrics
#

Tracing
#

11. Scaling Path
#

Phase 1 — MVP (< 10K DAU)
#

Phase 2 — Growth (10K → 500K DAU)
#

Phase 3 — Scale (500K → 10M DAU)
#

Phase 4 — Hyperscale (10M → 500M DAU)
#

12. Enterprise Considerations
#

13. Interview Tips
#

14. Further Reading
#