Skip to main content
  1. System designs - 100+/
  2. Classic/

Dropbox / Google Drive — Distributed File Sync at Scale

Lakshay Jawa
Author
Lakshay Jawa
Sharing knowledge on system design, Java, Spring, and software engineering best practices.
Table of Contents

1. Hook
#

In 2011, Dropbox engineers discovered that roughly 70% of all uploaded data was already on their servers — users syncing the same PDFs, stock photos, and installer packages. Switching from file-level to block-level deduplication immediately cut bandwidth costs by more than two-thirds. That insight defines the whole discipline of cloud file sync: the hard problems are not storage capacity or even bandwidth, but delta detection, deduplication, conflict resolution, and consistency across an arbitrarily large fleet of devices. Google Drive went further, embedding a collaborative editing layer (Docs, Sheets, Slides) on top of the same blob store. Today both systems handle hundreds of millions of users, billions of files, and near-real-time sync across mobile, desktop, and web clients — often over flaky connections.


2. Problem Statement
#

Functional Requirements
#

  1. Users can upload, download, update, delete, rename, and move files/folders.
  2. Changes on one device are synced to all other devices of the same user within seconds.
  3. Files can be shared with other users with configurable permissions (viewer / editor / owner).
  4. A full version history is maintained; users can roll back to any prior version.
  5. Large files upload efficiently even when the connection drops mid-transfer.

Non-Functional Requirements
#

Attribute Target
Sync latency (small file, good connection) < 5 s end-to-end
Upload resumability Resume from last committed chunk after reconnect
Storage efficiency Block-level deduplication across all users
Availability 99.99% (< 53 min downtime/year)
Durability 99.999999999% (11 nines) via multi-region replication
Concurrent editor conflict handling Last-write-wins for binary files; OT (Operational Transformation) for Google Docs

Out of Scope
#

  • Real-time collaborative editing internals (Google Docs OT engine)
  • Mobile-specific delta-sync protocols (rsync-over-cellular optimisations)
  • Virus scanning and DLP (Data Loss Prevention) pipelines
  • Billing and quota enforcement

3. Scale Estimation
#

Assumptions:

  • 500M registered users; 50M Daily Active Users (DAU).
  • Average storage per user: 15 GB → total corpus ~7.5 PB.
  • Average file size: 1 MB; average daily churn: 5 files updated per DAU.
  • Block size: 4 MB; deduplication hit rate: 60% (blocks already stored).
  • Metadata reads heavily outweigh writes (read:write ≈ 10:1 on metadata layer).
Metric Calculation Result
Upload QPS (Queries Per Second) 50M DAU × 5 files / 86 400 s ~2 900 QPS
Unique blocks written/day 2 900 QPS × 40% unique × 1 block/file avg ~100 M blocks/day
New block data/day 100 M × 4 MB ~400 TB/day raw ingest
Metadata reads 2 900 × 10 ~29 000 QPS
Storage corpus (with 3× replication) 7.5 PB × 3 ~22.5 PB
Bandwidth (uploads, post-dedup) 400 TB / 86 400 s ~37 Gbps sustained
Cache size (hot metadata, 20% of records) 500 M files × 20% × 2 KB/record ~200 GB

These numbers assume a 60% dedup hit saves ~600 TB/day of storage and bandwidth compared to naive whole-file uploads.


4. High-Level Design
#

The system separates metadata (file names, folder hierarchy, permissions, versions) from blob storage (the actual file bytes, stored as content-addressable blocks).

flowchart TD
    subgraph Client["Desktop / Mobile Client"]
        W[Watcher / File System Events]
        CH[Chunker & Hasher]
        SQ[Sync Queue]
    end

    subgraph Control["Control Plane"]
        MS[Metadata Service]
        NQ[Notification Service\nlong-poll / WebSocket]
    end

    subgraph Data["Data Plane"]
        BL[Block Upload API\nPre-signed URLs]
        BS[(Block Store\nS3 / GCS — content-addressed)]
        CD[CDN Edge\nfor downloads]
    end

    subgraph Meta["Metadata Store"]
        PG[(PostgreSQL / Spanner\nfile tree + versions)]
        RC[(Redis\nhot metadata cache)]
    end

    W -->|file changed| CH
    CH -->|block hashes| SQ
    SQ -->|check-then-commit| MS
    MS -->|blocks needed| BL
    BL -->|PUT blocks| BS
    MS -->|commit version| PG
    PG -->|cache warm| RC
    MS -->|push notification| NQ
    NQ -->|wake up peers| Client
    BS -->|download| CD
    CD -->|deliver blocks| Client

Write path: Client detects a file change → chunks and hashes → asks Metadata Service which blocks are missing → uploads only missing blocks to Block Store via pre-signed URL → commits the new file version atomically → Notification Service pushes a delta to all other devices of that user.

Read / sync path: Peer device receives notification → fetches updated metadata → downloads missing blocks from CDN edge (or Block Store on cache miss) → reassembles file locally.

Component Roles
#

Component Responsibility Key Choice
Client Chunker Split file into fixed-size (4 MB) or variable-size (CDC) blocks; compute SHA-256 per block Content-Defined Chunking (CDC) gives better dedup on insertions
Metadata Service File tree, version chain, sharing ACLs, block manifest per version Strong consistency (Spanner / CockroachDB) for conflict-free commits
Block Store Immutable, content-addressed blob store; never mutates a block S3 / GCS with key = SHA-256 of block content
Notification Service Fan-out change events to all connected devices of a user Long-poll or Server-Sent Events (SSE); WebSocket for mobile
CDN Edge Cache popular / recently-accessed blocks close to users CloudFront / Fastly; cache key = block SHA-256 (immutable, infinite TTL)

5. Deep Dive — Critical Components
#

5a. Chunking & Deduplication
#

The client splits each file into blocks. Fixed-size chunking (4 MB) is simple but fragile: inserting a byte near the start shifts all subsequent block boundaries, destroying the dedup hit. Content-Defined Chunking (CDC) using a rolling hash (Rabin fingerprint / Gear hash) finds natural split points, so an insertion only affects nearby blocks. Dropbox uses a custom CDC implementation; average chunk size ~4 MB, minimum 512 KB, maximum 8 MB.

Each block’s storage key is SHA-256(content). Before uploading, the client sends the full list of block hashes to the Metadata Service. The service responds with only the hashes it has not seen before — the client uploads only those blocks. This is server-side deduplication at the block level, and it works across all users (if two users upload the same ISO image, only one copy is stored).

5b. Block Upload with Pre-signed URLs
#

Routing large binary blobs through the Metadata Service would waste application-tier resources. Instead:

  1. Client POSTs a list of missing block hashes to POST /v1/blocks/check.
  2. Metadata Service returns a list of pre-signed PUT URLs (one per missing block) valid for 15 minutes.
  3. Client PUTs each block directly to S3/GCS — no application server in the hot path.
  4. Client POSTs a commit request: POST /v1/files/{id}/versions with the full block manifest.
  5. Metadata Service validates all blocks exist in the store, then atomically writes the new version row.
record BlockManifest(String fileId, long parentVersionId, List<String> blockSha256s) {}

record CommitRequest(BlockManifest manifest, String clientDeviceId, Instant clientMtime) {}

// Pseudocode — Metadata Service commit logic
@Transactional
public FileVersion commit(CommitRequest req) {
    var missing = blockStore.findMissing(req.manifest().blockSha256s());
    if (!missing.isEmpty()) throw new BlocksNotUploadedException(missing);

    long newVersion = versionRepo.nextVersion(req.manifest().fileId());
    var version = new FileVersion(
        req.manifest().fileId(), newVersion,
        req.manifest().blockSha256s(),
        req.clientMtime(), Instant.now()
    );
    versionRepo.save(version);
    notificationService.fanOut(req.manifest().fileId(), newVersion);
    return version;
}

5c. Sync Engine — Change Detection
#

The desktop client runs a file-system watcher (FSEvents on macOS, inotify on Linux, ReadDirectoryChangesW on Windows). On change:

  1. Compute SHA-256 of the changed file (stream-hash, never load whole file in RAM).
  2. Compare against locally cached hash. If equal, skip (spurious event).
  3. Chunk the file; compare block hashes against the cached manifest.
  4. Upload only changed blocks; commit new version.

For large files, the chunker is idempotent — if the upload is interrupted, the commit never fires, and on reconnect the client re-checks which blocks are already in the store and uploads only the remainder. This gives resumable upload for free.

5d. Notification & Peer Sync
#

When a commit completes, the Metadata Service publishes an event to a Kafka topic keyed by userId. The Notification Service consumes this stream and pushes a lightweight delta to all devices of that user that are currently connected:

{ "fileId": "f123", "version": 42, "timestamp": "2026-04-28T10:00:00Z" }

The receiving device fetches the new block manifest, diffs it against its local manifest, downloads only missing blocks from the CDN, and reassembles the file.


6. Data Model
#

files table
#

Column Type Notes
file_id UUID PK Immutable identifier
owner_user_id UUID FK Owner for quota accounting
parent_folder_id UUID FK Nullable (root files)
name VARCHAR(1024) Display name; not part of storage key
is_deleted BOOL Soft delete; purged after 30-day trash TTL
created_at TIMESTAMPTZ

file_versions table
#

Column Type Notes
version_id BIGINT PK Monotonically increasing per file
file_id UUID FK
block_sha256s TEXT[] Ordered list of block hashes
size_bytes BIGINT Sum of all block sizes
client_mtime TIMESTAMPTZ Device-local mtime at commit time
committed_at TIMESTAMPTZ Server commit timestamp
device_id UUID Which device created this version

Indexes:

  • (file_id, version_id DESC) — fetch latest version, range-scan history.
  • (owner_user_id, committed_at DESC) — “recent activity” feed.

Partitioning: Partition file_versions by committed_at month; old partitions are archived to cold storage after 12 months.

shares table
#

Column Type Notes
share_id UUID PK
file_id UUID FK
grantee_user_id UUID Nullable; null = link share
permission ENUM(‘viewer’,’editor’,‘owner’)
expires_at TIMESTAMPTZ Nullable

7. Trade-offs
#

Chunking Strategy: Fixed-size vs CDC
#

Option Pros Cons When to choose
Fixed 4 MB Simple, deterministic split points Poor dedup after insertions Append-only files (logs)
CDC (Rabin/Gear) High dedup even after mid-file edits More CPU on client; variable chunk sizes complicate caching General-purpose file sync

Conclusion: CDC wins for user files where edits happen in the middle (documents, code). The extra CPU cost (< 50 ms for a 100 MB file on modern hardware) is negligible compared to bandwidth saved.

Consistency Model: Strong vs Eventual
#

Option Pros Cons When to choose
Strong (Spanner / CockroachDB) No lost updates; clean conflict detection Higher write latency (~10 ms cross-region) Metadata commits
Eventual (Cassandra) Lower latency, easier horizontal scale Requires client-side conflict merge logic Block existence checks

Conclusion: Use strong consistency for the metadata commit (file version row) — losing an update is catastrophic for user trust. Use eventual consistency for read-path caches and the block existence index.

Conflict Resolution: Last-Write-Wins vs Fork
#

Option Pros Cons When to choose
Last-Write-Wins (LWW) Simple; no user friction Silently discards offline edits Binary files (images, executables)
Conflict Fork (Dropbox model) No data loss; user sees “conflicted copy” User must manually merge Text and binary files when offline editing detected
Operational Transformation (OT) Seamless collaborative editing Complex; requires operational log Google Docs / real-time collab

Conclusion: For binary files, create a “conflicted copy” on the loser’s device and commit both versions — no data loss. For collaborative documents, delegate to the OT engine (out of scope here).


8. Failure Modes
#

Component Failure Impact Mitigation
Block Store (S3) Regional outage Downloads fail; uploads stall Cross-region replication (S3 CRR); serve reads from replica; queue uploads locally
Metadata Service DB primary failure Commits blocked; reads may stall Automatic failover (Spanner's multi-region leader election); read from replica during commit downtime
Notification Service Missed push (client offline) Peer device not synced until it reconnects On reconnect, client polls for versions newer than its local watermark (cursor-based catch-up)
Upload (client side) Network drop mid-upload Partial blocks in store; commit never fires Blocks are immutable; on reconnect, re-check missing blocks and upload only those; no dangling state
Dedup index (block SHA-256 registry) Cache corruption / false positive Client skips upload, block missing at read time Block store is source of truth; dedup cache is advisory only; validate block existence at commit time
Hot user (large team folder) Thousands of notification fan-outs per second Notification Service overload Coalesce events per user/folder within a 500 ms window before fanning out; rate-limit per shared folder

9. Security & Compliance
#

Authentication & Authorization (AuthN/AuthZ):

  • OAuth 2.0 with PKCE (Proof Key for Code Exchange) for third-party apps. First-party clients use short-lived JWTs (JSON Web Tokens) signed with rotating RSA (Rivest–Shamir–Adleman) keys.
  • Every API call is validated against an ACL (Access Control List) check in the Metadata Service before block URLs are issued. Sharing a folder grants read/write on all descendant files — evaluated lazily at request time, not materialised into every row.

Encryption:

  • At rest: blocks in S3/GCS encrypted with AES-256 (Advanced Encryption Standard 256-bit); metadata DB encrypted with TDE (Transparent Data Encryption). Encryption keys managed by the cloud KMS (Key Management Service) with per-customer CMKs (Customer-Managed Keys) for enterprise tiers.
  • In transit: TLS (Transport Layer Security) 1.3 everywhere; pre-signed block URLs expire in 15 minutes and are scoped to a single block hash.

Input Validation:

  • Block SHA-256 hashes are validated server-side before issuing pre-signed URLs — a client cannot request a URL for an arbitrary key pattern.
  • File names are Unicode-normalised and stripped of path traversal sequences (../) before storage.

GDPR / Right to Erasure:

  • Soft-delete moves files to trash; hard-delete after 30 days purges the file_versions rows. Because blocks are shared across users (dedup), a block is only physically deleted when its reference count drops to zero — tracked in a separate GC (Garbage Collection) job that runs nightly.
  • Crypto-shredding for enterprise: encrypt each user’s blocks with a per-user DEK (Data Encryption Key); erasure = delete the DEK.

Audit Log:

  • Immutable audit stream: every file operation (upload, download, share, delete, permission change) emitted to a WORM (Write Once Read Many) log (S3 Object Lock / BigQuery append-only table). Required for SOC 2 (System and Organization Controls 2) Type II and HIPAA (Health Insurance Portability and Accountability Act) enterprise customers.

Rate Limiting:

  • Per-user upload/download throughput capped at the Metadata Service layer (token bucket, 100 MB/s default, configurable per tier). Prevents a single power user from monopolising Block Store bandwidth.

10. Observability
#

RED Metrics (Rate / Errors / Duration)
#

Signal Metric Alert Threshold
Upload rate uploads_committed_total (counter) < 50% of 7-day baseline → PagerDuty
Upload error rate uploads_failed_total / uploads_attempted_total > 1% over 5 min
Commit latency (p99) metadata_commit_duration_seconds > 2 s
Download error rate block_download_errors_total > 0.1% over 5 min
Notification delivery lag notification_lag_seconds (histogram) p95 > 10 s

Saturation Metrics
#

Resource Metric Alert Threshold
Block Store S3 request rate vs service quota > 80% of quota
Metadata DB Replication lag > 5 s
Notification Service Connection count > 90% of max

Business Metrics
#

  • Sync success rate: fraction of file changes that reach all devices within 30 s.
  • Dedup ratio: bytes_skipped / bytes_attempted — tracks storage efficiency.
  • p99 end-to-end sync latency: from client commit to peer reassembly; target < 30 s.

Tracing
#

Distributed traces (OpenTelemetry) span the full upload path: client SDK → Metadata Service → Block Store → Notification Service → peer client. The version_id is the trace correlation ID; every service logs it, enabling timeline reconstruction for any sync incident.


11. Scaling Path
#

Phase 1 — MVP (< 10K DAU)
#

Single-region deployment. PostgreSQL for metadata (single primary). S3 for blocks. No CDN — clients download directly from S3. Monolithic backend service. No notification push — clients poll every 30 s.

What breaks first: Polling at 10K DAU generates 333 QPS of metadata reads — manageable but noisy.

Phase 2 — Growth (10K → 500K DAU)
#

Replace polling with long-poll / SSE (Server-Sent Events) connections. Add Redis for hot metadata cache. Add CDN for block downloads. Introduce the check-then-commit block dedup flow. Split Metadata Service from Block Upload API.

What breaks first: Metadata DB write throughput. A single Postgres primary tops out around 5 000 write TPS (Transactions Per Second). At 500K DAU with 5 file changes/day, peak writes hit ~30 TPS — fine. But shared-folder fan-outs can spike this 100×.

Phase 3 — Scale (500K → 10M DAU)
#

Introduce Spanner / CockroachDB for globally distributed metadata with strong consistency. Shard notification connections across a fleet of WebSocket servers, using Kafka as the fan-out backbone. Add a dedup bloom filter in the client to skip the server check for blocks the client has previously confirmed exist. Introduce async version garbage collection.

What breaks first: Notification fan-out for large shared folders (teams with thousands of members). Move to a hierarchical fan-out: folder-level subscription with coalescing.

Phase 4 — Hyperscale (10M → 500M DAU)
#

Per-region block store with cross-region replication and intelligent routing (serve blocks from the region nearest to the device). Multi-cell metadata sharding by user_id range. Separate quota service. ML-driven prefetch: predict which files a user will open on their mobile device and pre-warm the CDN before they arrive.

What breaks first: Block store cold-start costs. Tiered storage (hot/warm/cold) with lifecycle policies moves infrequently-accessed blocks to cheaper tiers (S3 Glacier).


12. Enterprise Considerations
#

Brownfield Integration:

  • Large enterprises already run SharePoint, NFS (Network File System) shares, or on-prem NAS (Network-Attached Storage). Dropbox Business and Google Workspace offer on-prem sync agents that bridge the local file system to the cloud store without migrating all data at cutover.

Build vs Buy:

  • Block Store: always buy (S3/GCS/Azure Blob). Building a durable, globally-replicated object store from scratch takes years.
  • Metadata DB: Spanner for Google, CockroachDB or Aurora Global for others. Viable open-source option: PostgreSQL with Citus for sharding.
  • CDN: Cloudfront, Fastly, or Akamai. The immutable block cache key (SHA-256) means CDN hit rates can exceed 90% for popular content.
  • Notification: build in-house on top of Kafka + a WebSocket gateway; off-the-shelf solutions (Pusher, Ably) work for early stages.

Multi-Tenancy:

  • Enterprise customers (e.g., a hospital using Google Workspace) require data residency (blocks stored only in the EU). Implement per-tenant storage class with a region affinity tag on the metadata row. Block upload routing respects the tag.
  • Noisy-neighbour risk: a single enterprise team with 10 000 members generates massive notification fan-out on every commit. Rate-limit notifications per shared folder, coalesce within a 1 s window.

TCO (Total Cost of Ownership) Ballpark:

  • Block storage: ~$0.023/GB/month (S3 Standard). At 7.5 PB active corpus: ~$172K/month storage.
  • Dedup savings: 60% hit rate → effective cost on $0.009/GB-equivalent.
  • Egress: $0.09/GB from S3. CDN offloads ~85% → effective egress ~$0.013/GB-equivalent.
  • Compute (Metadata Service + Notification): ~$50K/month at 50M DAU scale.

Conway’s Law Implication: The clean split between Metadata Service and Block Store almost always maps to two separate engineering teams. The API boundary (block manifests, pre-signed URLs) becomes the contract between those teams — keep it stable and versioned.


13. Interview Tips
#

  • Clarify scale first: “How many users, average file size, what’s the expected change rate per user per day?” These numbers drive every sizing decision. A 10K-user startup and a 500M-user consumer product have totally different bottlenecks.
  • Lead with chunking and dedup: Most candidates jump to “store files in S3” — the 10× more interesting answer is why you chunk first, what CDC buys you, and how server-side dedup cuts bandwidth costs. This is the differentiating insight.
  • Don’t forget the client sync engine: Interviewers often probe “how does the desktop client know what changed?” Cover file system watchers, the local manifest cache, and how the sync queue batches rapid successive saves.
  • Nail conflict resolution: “What happens when two devices edit the same file offline?” This is the canonical follow-up. Know the three options (LWW, conflict fork, OT) and when each is appropriate.
  • Vocabulary that signals fluency: content-addressable storage, CDC (Content-Defined Chunking), pre-signed URLs, idempotent block upload, watermark-based catch-up sync, crypto-shredding for GDPR erasure, WORM audit log.

14. Further Reading
#

  • Dropbox Magic Pocket (2016): Dropbox’s engineering blog post on building their own block store to replace S3 — covers erasure coding, rack-aware placement, and the economics of going on-prem at exabyte scale.
  • Google’s Colossus: The successor to GFS (Google File System) that underpins Google Drive’s blob layer. The original GFS paper (Ghemawat et al., SOSP 2003) remains the canonical reference for distributed file system design.
  • rsync algorithm (Andrew Tridgell, 1996): The rolling-checksum delta-sync algorithm that inspired modern chunking approaches. Short and readable — understanding it deeply answers 80% of “how do you sync efficiently over a slow link?” questions.
  • CAP Theorem (Brewer, 2000): The theoretical foundation for the consistency trade-off between the Metadata Service (CP) and the block existence cache (AP).

Related

Google Docs — Real-Time Collaborative Editing at Scale

1. Hook # In 2006, Google acquired Writely and within two years turned it into Google Docs — the first mainstream product that let multiple people type in the same document at the same time without locking or “check-out” workflows. The core problem sounds deceptively simple: if Alice deletes character 5 while Bob inserts a character at position 4, whose version wins? The naïve answer (“last write wins”) produces corrupted documents. The real answer — Operational Transformation (OT) — is the algorithm that makes collaborative editing feel like magic, and it is one of the most subtle distributed-systems problems you will encounter in an interview. Every major collaborative editor (Google Docs, Notion, Figma, Microsoft 365) is built on either OT or its younger sibling CRDT (Conflict-free Replicated Data Type). Understanding which to use, and why, separates candidates who have thought deeply about consistency from those who have memorised buzzwords.

Search Engine — Google-Scale Crawl, Index, Rank, and Serve

1. Hook # Google processes 8.5 billion searches per day — roughly 99 000 queries per second at peak — and returns results in under 200 ms. Behind that sub-second response is a pipeline that never fully stops: a web crawler perpetually downloading ~20 billion pages, a MapReduce-scale indexing system converting raw HTML into a compressed inverted index, a multi-stage ranking pipeline that scores hundreds of signals in milliseconds, and a serving layer that shards the index across thousands of machines so no single query touches more than a fraction of the corpus. Building a search engine from scratch is perhaps the canonical “design a distributed system” problem because it combines almost every hard problem in the field: distributed crawling, large-scale data processing, near-real-time index updates, low-latency high-throughput query serving, and machine learning (ML)-based ranking. Even a simplified version at 1/1000th of Google’s scale teaches you more about distributed systems than almost any other exercise.

Netflix — Video Streaming Platform

1. Hook # At peak, Netflix accounts for 15% of global internet downstream traffic — roughly 700 Gbps flowing to subscribers in 190 countries. What makes this feasible is not raw bandwidth: it is a carefully engineered pipeline that converts every raw title into over 1,200 encoded video files before a single subscriber presses play, then serves those files from ISP-embedded appliances called Open Connect Appliances (OCA) rather than from a traditional cloud CDN. The streaming experience you see — where the picture quality silently improves while you watch — is ABR (Adaptive Bitrate) streaming dynamically switching between those pre-encoded variants based on your network conditions. Behind the personalised rows on the homepage sits a recommendation engine that runs 45+ algorithms to surface the title you are most likely to start watching in the next 30 seconds. Each of these subsystems operates at a scale where a 0.1% drop in streaming reliability translates to 250,000 subscribers unable to watch at that moment.

Uber / Ride-Sharing System

1. Hook # Every time someone taps “Request Ride” on Uber, the platform must answer a deceptively hard spatial query in under a second: which of the thousands of nearby drivers is the best match for this rider, given their location, heading, vehicle type, and current workload? Uber processes 25 million trips per day across 70+ countries, with peak demand spikes during commute hours, concerts, and bad weather — all of which arrive simultaneously in the same city blocks.

YouTube — Video Upload, Transcoding & Global Delivery

1. Hook # Every minute, creators upload 500 hours of video to YouTube — roughly 720,000 hours of raw footage per day that must be validated, transcoded into 10+ adaptive formats, and made globally available before viewers ever click play. Unlike Netflix (a closed catalogue of licensed titles transcoded offline), YouTube is a live upload platform: a creator in Lagos hits “publish” and expects global playback within minutes. The upload pipeline, transcoding infrastructure, and two-tier CDN (Content Delivery Network) that make this possible are among the most complex media-engineering systems on the planet. On the consumption side, 2 billion+ logged-in users watch over 1 billion hours of video daily — a recommendation challenge that dwarfs most advertising systems in latency sensitivity and business impact. If the recommendation model serves the wrong video, engagement drops; if the transcoder stalls, creators lose monetisation time.

WhatsApp / Chat Messaging System

1. Hook # WhatsApp delivers 100 billion messages every day to 2 billion users across 180+ countries — all end-to-end encrypted (E2EE), with sub-second latency, and with a global engineering team historically smaller than 50 engineers. The system does this while providing strong delivery guarantees (a message is either delivered exactly once or the sender knows it was not), preserving per-conversation message ordering even when users switch networks mid-send, and maintaining ephemeral server storage so that once a message is delivered it lives only on client devices.