1. Hook #
In 2006, Google acquired Writely and within two years turned it into Google Docs — the first mainstream product that let multiple people type in the same document at the same time without locking or “check-out” workflows. The core problem sounds deceptively simple: if Alice deletes character 5 while Bob inserts a character at position 4, whose version wins? The naïve answer (“last write wins”) produces corrupted documents. The real answer — Operational Transformation (OT) — is the algorithm that makes collaborative editing feel like magic, and it is one of the most subtle distributed-systems problems you will encounter in an interview. Every major collaborative editor (Google Docs, Notion, Figma, Microsoft 365) is built on either OT or its younger sibling CRDT (Conflict-free Replicated Data Type). Understanding which to use, and why, separates candidates who have thought deeply about consistency from those who have memorised buzzwords.
2. Problem Statement #
Functional Requirements #
- Multiple users can edit the same document concurrently; changes from all users appear in near real-time.
- The document converges to the same state on all clients regardless of network delays or operation ordering.
- Cursor positions and selection ranges of collaborators are visible in real-time.
- Full revision history is maintained; any previous state can be restored.
- Documents can be shared with configurable permissions (viewer / commenter / editor / owner).
- Offline editing is supported; changes sync when connectivity is restored.
Non-Functional Requirements #
| Attribute | Target |
|---|---|
| Operation propagation latency (p95) | < 200 ms for users in the same region |
| Convergence guarantee | All clients reach identical state eventually |
| Document availability | 99.99% (reads/writes must not block on collaborator failures) |
| Revision history retention | Indefinite (all-time, compressed) |
| Concurrent editors per document | Up to ~100 simultaneous editors |
| Document size | Up to ~1 M characters (soft cap) |
Out of Scope #
- Real-time voice/video (Google Meet integration is a separate service)
- Spreadsheet formula evaluation (Sheets-specific computation engine)
- Presentation rendering (Slides-specific layout engine)
- Mobile-specific offline-first sync protocol details
3. Scale Estimation #
Assumptions:
- 3B registered Google accounts; ~500M DAU (Daily Active Users) touching Google Workspace.
- 1B documents in existence; ~50M documents actively edited per day.
- Average concurrent editors per active document: 2–3; peak for viral/shared docs: ~100.
- Average operation size: 20 bytes (insert/delete + position + metadata).
- Operations per active user per minute: ~60 (one keystroke per second).
- Revision snapshots: full checkpoint every 100 operations.
| Metric | Calculation | Result |
|---|---|---|
| Active editing sessions | 50M docs/day × avg 3 editors = 150M sessions | ~1 740 sessions/s peak |
| Operations/second | 150M sessions × 60 ops/min / 60 s | ~150 000 ops/s |
| Operation payload/s | 150 000 × 20 bytes | ~3 MB/s (tiny) |
| WebSocket connections | 150M concurrent sessions (peak day) | ~1.74M persistent connections/s |
| Revision log storage/day | 150 000 ops/s × 86 400 s × 20 bytes | ~260 GB/day |
| Snapshot storage | 1B docs × avg 10 KB (compressed) | ~10 TB total |
The bottleneck is not storage or CPU — it is maintaining millions of long-lived WebSocket connections and ordering concurrent operations per document without a global lock.
4. High-Level Design #
The architecture separates three concerns: the real-time collaboration session (OT engine + WebSocket gateway), the persistent document store (operation log + snapshots), and the metadata layer (sharing, permissions, file tree).
flowchart TD
subgraph Clients["Clients (Browser / Mobile)"]
C1[Alice — Chrome]
C2[Bob — Chrome]
C3[Carol — Mobile]
end
subgraph Gateway["WebSocket Gateway\n(regional, sticky sessions)"]
WS[WebSocket Server\nper-document session]
end
subgraph Collab["Collaboration Service"]
OT[OT Engine\noperation transform + apply]
SQ[Operation Sequencer\nper-document mutex]
end
subgraph Storage["Storage Layer"]
OL[(Operation Log\nBigtable / Spanner)]
SN[(Snapshot Store\nGCS — full doc every 100 ops)]
MC[(Metadata DB\nSpanner — sharing, perms)]
end
subgraph Presence["Presence Service"]
PR[Cursor / Selection\nfan-out]
end
C1 <-->|WebSocket| WS
C2 <-->|WebSocket| WS
C3 <-->|WebSocket| WS
WS -->|raw op + client revision| OT
OT <-->|lock + sequence| SQ
OT -->|transformed op| WS
OT -->|append| OL
OT -->|periodic| SN
WS <-->|presence events| PR
MC -->|ACL check| OT
Write path: Client sends an operation tagged with the revision number it was based on → OT Engine transforms the op against all concurrent ops since that revision → assigns a global sequence number → broadcasts the transformed op to all other clients in the session → appends to the Operation Log.
Read path (document load): Fetch the nearest snapshot ≤ target revision from GCS (Google Cloud Storage) → replay operations from the Operation Log since that snapshot → reconstruct current document state.
Component Roles #
| Component | Responsibility | Key Choice |
|---|---|---|
| WebSocket Gateway | Maintain persistent connections; route ops to correct document session | Sticky sessions per document — all editors of doc X land on same server shard |
| OT Engine | Transform and apply concurrent operations; maintain server-authoritative document state | Jupiter OT algorithm (used by Google); single server state simplifies transform functions |
| Operation Sequencer | Per-document serialisation point; assigns monotonic revision numbers | In-memory mutex per document; single leader per document shard |
| Operation Log | Append-only log of every transformed operation; source of truth for history | Bigtable keyed by (doc_id, revision); ordered scan for replay |
| Snapshot Store | Full document state checkpoint every N operations; avoids replaying the entire log on load | GCS blob; JSON or Protobuf serialised; snapshot every 100 ops |
| Presence Service | Broadcast cursor positions, selections, and user avatars to collaborators | Ephemeral; stored in Redis with 10 s TTL (Time-To-Live); not persisted |
5. Deep Dive — Critical Components #
5a. Operational Transformation #
OT is built on two properties:
- Convergence: All clients that receive the same set of operations (in any order) must reach the same document state.
- Intention preservation: The meaning of an operation must be honoured even after transformation.
The simplest example: a document contains "ab".
- Alice sends
Insert('c', position=1)→ intended:"acb". - Bob sends
Delete(position=0)→ intended:"b". - Server receives Alice’s op first (revision 1), then Bob’s (revision 1, concurrent).
Bob’s Delete(0) was formed when the doc was "ab". After Alice’s insert, the doc is "acb". Bob intended to delete 'a' — still at position 0. No transformation needed here. But if Bob had sent Delete(position=1) (delete 'b'), that position must be shifted to 2 after Alice’s insert. The transform function T(op_b, op_a) produces the adjusted operation.
Google Docs uses the Jupiter protocol (Nichols et al., 1995): a client-server model where the server is the single serialisation point. This eliminates the need for peer-to-peer transform functions (which are notoriously hard to prove correct for complex operations). Each client maintains:
- A local document state.
- A queue of unacknowledged operations.
- The server revision it last saw.
When an op arrives from the server that was concurrent with an unacknowledged local op, the client transforms the server op against its local queue before applying it.
sealed interface Op permits Insert, Delete {}
record Insert(int position, String text) implements Op {}
record Delete(int position, int length) implements Op {}
final class Transform {
// Transform op2 as if op1 had already been applied.
static Op transform(Op op2, Op op1) {
return switch (op1) {
case Insert i -> transformAgainstInsert(op2, i);
case Delete d -> transformAgainstDelete(op2, d);
};
}
private static Op transformAgainstInsert(Op op2, Insert i) {
return switch (op2) {
case Insert ins -> ins.position() <= i.position()
? ins
: new Insert(ins.position() + i.text().length(), ins.text());
case Delete del -> del.position() < i.position()
? del
: new Delete(del.position() + i.text().length(), del.length());
};
}
private static Op transformAgainstDelete(Op op2, Delete d) {
return switch (op2) {
case Insert ins -> ins.position() <= d.position()
? ins
: new Insert(Math.max(d.position(), ins.position() - d.length()), ins.text());
case Delete del -> {
if (del.position() >= d.position() + d.length())
yield new Delete(del.position() - d.length(), del.length());
// Overlapping deletes: clamp
int newPos = Math.min(del.position(), d.position());
int newLen = Math.max(0, del.length() - Math.max(0,
d.position() + d.length() - del.position()));
yield new Delete(newPos, newLen);
}
};
}
}5b. Operation Sequencer and Per-Document Locking #
A document’s operations must be totally ordered. The sequencer is a single in-process lock per document on the Collaboration Service instance responsible for that document. Operations arrive concurrently from multiple WebSocket connections; the sequencer serialises them, assigns a monotonically increasing revision number, transforms each incoming op against all ops since its base revision, and broadcasts the result.
Because the sequencer is in-memory, a crash loses in-flight ops. The mitigation: clients buffer sent ops and resend them if they do not receive an acknowledgement within 5 s. The sequencer is idempotent for ops with the same client-generated UUID (Universally Unique Identifier).
5c. Document Load and Snapshot Replay #
On document open:
- Fetch metadata (permissions, title, current revision number) from Spanner — fast, < 10 ms.
- Fetch the nearest snapshot ≤ current revision from GCS.
- Fetch operations in the range
(snapshot_revision, current_revision]from Bigtable. - Apply operations to snapshot state to reconstruct current document.
With a snapshot every 100 ops, step 3 fetches at most 100 rows — typically < 5 ms. For documents with millions of operations but regular snapshots, this pattern keeps load times bounded regardless of document age.
5d. Presence and Cursor Awareness #
Cursor positions are ephemeral and high-frequency (every mouse move, every keystroke selection). They are not stored in the Operation Log. Instead:
- Clients send cursor-update messages over the same WebSocket connection at most every 50 ms.
- The WebSocket Gateway fans these out directly to all other clients in the session without writing to any database.
- Redis holds the last-known cursor for each user in a session with a 10 s TTL; used only when a new collaborator joins mid-session and needs to hydrate the initial presence state.
Cursor positions must also be transformed as remote operations arrive — an insert before Alice’s cursor shifts her position forward. The client-side OT engine handles this identically to document operations.
6. Data Model #
documents table (Spanner)
#
| Column | Type | Notes |
|---|---|---|
doc_id |
STRING(36) PK | UUID |
owner_user_id |
STRING(36) | |
title |
STRING(1024) | |
current_revision |
INT64 | Monotonically increasing |
created_at |
TIMESTAMP | |
last_modified_at |
TIMESTAMP |
operations table (Bigtable)
#
Row key: {doc_id}#{revision:010d} (zero-padded for lexicographic scan)
| Column | Type | Notes |
|---|---|---|
op_type |
STRING | insert / delete / format |
payload |
BYTES | Protobuf-serialised operation |
author_user_id |
STRING | |
client_op_id |
STRING | Idempotency key |
timestamp |
TIMESTAMP | Server commit time |
Why Bigtable here: Append-only, high-throughput write, ordered range scan by (doc_id, revision) — exactly the Bigtable sweet spot. No updates, no deletes (log is immutable).
snapshots (GCS)
#
Object key: snapshots/{doc_id}/{revision}.pb.gz
Contains: full document content (Protobuf), the revision number at snapshot time, and a SHA-256 checksum. Immutable once written.
shares table (Spanner)
#
| Column | Type | Notes |
|---|---|---|
share_id |
STRING PK | |
doc_id |
STRING FK | |
grantee_user_id |
STRING | Nullable for link shares |
role |
STRING | viewer / commenter / editor / owner |
expires_at |
TIMESTAMP | Nullable |
7. Trade-offs #
OT vs CRDT #
| Option | Pros | Cons | When to choose |
|---|---|---|---|
| OT (Jupiter / Google Wave) | Proven at scale; intention-preserving; works well for rich text | Requires a central server for total ordering; transform functions are hard to write correctly for complex types | Central-server architectures; rich text with formatting |
| CRDT (e.g. Yjs, Automerge) | Fully peer-to-peer; no central sequencer needed; simpler convergence proofs | Higher memory overhead (tombstones for deleted chars); harder to implement rich formatting intentions | P2P / offline-first apps; local-first architectures |
Conclusion: Google Docs uses OT with a central server — it was the right choice in 2006 and remains so because it enables a single authoritative history. New entrants (Notion, Linear) often choose Yjs (a CRDT library) for its offline-first properties. Neither is universally superior.
Per-Document vs Global Sequencer #
| Option | Pros | Cons | When to choose |
|---|---|---|---|
| Per-document in-memory sequencer | Zero coordination overhead between documents; horizontally scalable | Single point of failure per document; state lost on crash | Google Docs model — documents are independent |
| Global distributed sequencer (Zookeeper / Spanner) | Durable; survives sequencer crashes without client buffering | High latency for every operation; cross-document ordering not needed | Multi-entity transactional systems |
Conclusion: Per-document sequencer wins because documents are fully independent units. Crash recovery is handled by client-side op buffering and re-delivery, not by durable distributed consensus on the hot path.
Operation Log vs Full-State Storage #
| Option | Pros | Cons | When to choose |
|---|---|---|---|
| Append-only operation log | Full revision history for free; easy audit; compact | Document load requires replay; replay time grows with doc age | Always use this for collaborative editors |
| Full-state snapshots only | Fast load | No history; large storage; no conflict resolution | Not suitable for collaborative docs |
Conclusion: Use both — log as source of truth, periodic snapshots to bound load time.
8. Failure Modes #
| Component | Failure | Impact | Mitigation |
|---|---|---|---|
| Collaboration Service (sequencer crash) | In-flight ops lost | Clients briefly see stale state; acknowledged ops replayed | Clients buffer all unacknowledged ops; resend on reconnect to new sequencer; last committed revision from Bigtable is the recovery point |
| WebSocket Gateway crash | All sessions on that node disconnected | Clients auto-reconnect; 2–5 s visible disruption | Client reconnect with exponential backoff; session state is in the sequencer, not the gateway |
| Bigtable write failure | Op transformed and broadcast but not persisted | Data loss if sequencer also crashes before retry | Write to Bigtable synchronously before ACK-ing the client; do not broadcast until persisted |
| Network partition (client offline) | Client edits locally; diverges from server | Conflict on reconnect if others edited same region | Client queues ops with local revision; on reconnect, transforms queued ops against server ops since last ack'd revision — standard OT recovery |
| Hot document (100 concurrent editors) | Single sequencer becomes a CPU bottleneck | Operation latency spikes | Rate-limit op frequency per client (max 10 ops/s); debounce fast typists; soft cap of 100 simultaneous editors |
| Corrupted snapshot | Document load fails or produces wrong state | Document unreadable | Verify snapshot checksum on load; fall back to previous checkpoint and replay more ops; checksums validated on write |
9. Security & Compliance #
AuthN/AuthZ (Authentication / Authorization):
- Every WebSocket connection is authenticated via an OAuth 2.0 access token checked on handshake. Tokens are short-lived (1 hour); the WebSocket keeps a heartbeat to renew.
- Each operation is checked against the
sharestable ACL (Access Control List) before being accepted by the OT Engine. A viewer role causes the connection to be read-only — ops are silently dropped server-side and the client is notified.
Encryption:
- TLS (Transport Layer Security) 1.3 for all WebSocket traffic.
- At rest: Bigtable and GCS encrypted with AES-256 (Advanced Encryption Standard 256-bit); Google-managed keys by default, CMEK (Customer-Managed Encryption Keys) available for Workspace Enterprise.
Input Validation:
- Operations are validated for structural correctness (position within document bounds, non-negative length) before entering the OT Engine. Malformed ops are rejected with an error code; they never reach the sequencer.
- Document size is enforced: attempts to insert content beyond the 1M-character soft cap return a quota error.
GDPR / Right to Erasure:
- Deleting a document triggers async purge of all Bigtable rows for that
doc_idand deletion of all GCS snapshots. Because the operation log is the only copy of the content (no cross-user deduplication unlike file sync), deletion is complete. - Shared link tokens are invalidated immediately on permission revocation.
Audit Log:
- Document access events (open, edit, download, share) streamed to Google Vault (an immutable audit log product). Required for Workspace Enterprise SOC 2 (System and Organization Controls 2) compliance.
10. Observability #
RED Metrics (Rate / Errors / Duration) #
| Signal | Metric | Alert Threshold |
|---|---|---|
| Op acceptance rate | ops_committed_total per doc_id |
Sudden drop → sequencer issue |
| Op rejection rate | ops_rejected_total / ops_received_total |
> 0.1% |
| Op round-trip latency (p99) | op_rtt_seconds (client sends → receives ACK) |
> 500 ms |
| WebSocket connection drops | ws_disconnect_total rate |
> 2× baseline |
| Document load time (p95) | doc_load_duration_seconds |
> 3 s |
Saturation Metrics #
| Resource | Metric | Alert Threshold |
|---|---|---|
| Sequencer CPU | Per-document op queue depth | > 50 pending ops |
| Bigtable write throughput | Rows written/s vs tablet capacity | > 80% |
| WebSocket connections per gateway node | active_connections |
> 50 000 |
Business Metrics #
- Collaboration session length: median time two or more users are simultaneously active in a document.
- Conflict rate: fraction of ops that required non-trivial transformation (position delta > 0) — a proxy for how often concurrent edits happen.
- Offline edit rate: fraction of sessions that submitted buffered ops on reconnect — informs offline sync investment.
Tracing #
Each operation carries a trace_id (OpenTelemetry). The trace spans: client SDK → WebSocket Gateway → OT Engine → Bigtable write → broadcast. P99 traces for slow operations are automatically sampled and sent to Cloud Trace for root-cause analysis.
11. Scaling Path #
Phase 1 — MVP (< 10K DAU) #
Single-region. One Collaboration Service instance handles all documents sequentially. PostgreSQL stores both operations and snapshots. WebSocket connections on the same process. Simple broadcast: iterate connected clients.
What breaks first: A single process cannot maintain tens of thousands of WebSocket connections and run the OT engine under load. Node / Netty-based async I/O helps, but the single-threaded sequencer becomes a bottleneck around 1 000 concurrent editing sessions.
Phase 2 — Growth (10K → 500K DAU) #
Shard documents across a fleet of Collaboration Service instances by doc_id hash. Introduce a load balancer that routes WebSocket connections for the same document to the same instance (consistent hashing). Migrate operation log to Bigtable. Add Redis for presence state. Add GCS snapshot pipeline.
What breaks first: Hot documents (team meeting notes, shared templates) concentrate load on one shard. Add a per-document rate limiter and soft cap on concurrent editors.
Phase 3 — Scale (500K → 10M DAU) #
Multi-region deployment with regional sequencers. Documents are “homed” to a region (the region where the first editor opened the document). Cross-region latency is accepted for collaborators in other regions (~100–150 ms extra RTT (Round-Trip Time)). Add a CDN-cached read path for document load (snapshot + last N ops cached at edge for 5 s).
What breaks first: Cross-region collaboration on the same document. For truly latency-sensitive use cases, explore CRDT-based replication between regional sequencers, accepting eventual (not immediate) convergence across regions.
Phase 4 — Hyperscale (10M → 500M DAU) #
Per-tenant regional affinity (EU data residency for GDPR). Automated snapshot frequency tuning (more frequent snapshots for hot documents; sparse for cold ones). ML-based prediction of document activity spikes (pre-warm sequencer instances before large meetings). Tiered operation log storage (Bigtable for recent; BigQuery for historical analytics).
12. Enterprise Considerations #
Brownfield Integration:
- Enterprises already have SharePoint / Confluence / Office 365. Google Workspace migration tooling imports .docx files into Docs format, converting the binary format into an initial snapshot — the operation log starts from version 1 at import time.
Build vs Buy:
- OT Engine: build (no general-purpose OT library handles rich text formatting correctly at Google scale; the transform functions are domain-specific).
- WebSocket infrastructure: build on top of Netty / gRPC streaming / Cloud Run WebSockets. Do not use Socket.IO at scale — its fallback mechanisms add complexity.
- Operation Log: Bigtable or DynamoDB Streams. Cassandra is viable but requires careful compaction tuning for append-heavy workloads.
- Presence: Redis Pub/Sub or Pusher for early stages; build in-house once connection counts exceed 100K.
Multi-Tenancy:
- Each enterprise customer’s documents are isolated in a separate Bigtable instance (or at minimum a separate key prefix with IAM (Identity and Access Management) boundaries). Noisy-neighbour risk: a large enterprise generating millions of ops/s must not degrade other tenants.
- Data residency: GDPR-regulated customers require EU-only Bigtable and GCS buckets. The document-homing-to-region model (Phase 3) handles this; the metadata service enforces region affinity on first open.
TCO (Total Cost of Ownership) Ballpark (at 50M DAU):
- Bigtable: ~$0.065/GB/month for storage + ~$0.026/1M reads. At 260 GB/day ingest: ~$600K/month.
- GCS snapshots: ~$0.02/GB. At 10 TB: ~$200/month (negligible).
- Collaboration Service compute: ~10 000 cores at peak (50M sessions / 5 000 sessions per 8-core node) → ~$100K/month on preemptible instances.
- Spanner (metadata): ~$0.9/node/hour, 10 nodes → ~$6 500/month.
Conway’s Law Implication: The clean split between the OT Engine team and the WebSocket Gateway team almost always produces an internal API boundary that mirrors the on-wire protocol. Keep that protocol versioned — clients in the wild run old versions for months.
13. Interview Tips #
- Start with the convergence problem. Don’t jump to architecture — first explain why concurrent edits are hard (Alice and Bob both think position 5 is the right place; after the other’s op, it isn’t). This shows you understand the core difficulty.
- Know both OT and CRDT at a high level. You don’t need to implement either from scratch, but you must be able to say: “OT needs a central server; CRDT doesn’t but costs more memory.” Know that Google Docs uses OT and that Figma / Notion lean toward CRDT (Yjs).
- Separate the operation log from the document state. Many candidates store the “current document” as a mutable blob. The correct answer is an immutable, append-only log of operations with periodic snapshots. This also gives revision history for free.
- Nail the failure scenario. “What happens when a client goes offline for 10 minutes and then reconnects with 500 buffered ops?” Walk through: client sends first buffered op with base revision R; server has advanced to R+200; server transforms the client op against all 200 server ops since R; client receives the transformed ops and replays locally. This is the heart of OT.
- Vocabulary that signals fluency: Operational Transformation, Jupiter protocol, intention preservation, convergence, CRDT (Conflict-free Replicated Data Type), tombstone, operational log, snapshot-and-replay, sticky WebSocket session, cursor transformation, presence fan-out, idempotent operation delivery.
14. Further Reading #
- “High-Latency, Low-Bandwidth Windowing in the Jupiter Collaboration System” (Nichols et al., UIST 1995): The original Jupiter OT paper. Short (8 pages) and the theoretical foundation for Google Docs. Read this before any interview.
- Yjs CRDT library (Kevin Jahns): The leading open-source CRDT for collaborative text editing. Its README explains why the author chose CRDT over OT and the memory trade-offs involved — essential reading for understanding the alternative.
- Google Wave “Federation Protocol” (2009): Google’s open-source attempt to federate collaborative editing across servers. The protocol whitepaper explains multi-server OT, which is significantly harder than single-server OT and explains why Wave was ultimately discontinued.
- “Logoot: A Scalable Optimistic Replication Algorithm for Collaborative Editing on P2P Networks” (Weiss et al., 2009): The original sequence CRDT paper; foundational for understanding how CRDTs solve the same problem OT solves, but with different trade-offs.