WhatsApp / Chat Messaging System

Table of Contents

1. Hook
#

WhatsApp delivers 100 billion messages every day to 2 billion users across 180+ countries — all end-to-end encrypted (E2EE), with sub-second latency, and with a global engineering team historically smaller than 50 engineers. The system does this while providing strong delivery guarantees (a message is either delivered exactly once or the sender knows it was not), preserving per-conversation message ordering even when users switch networks mid-send, and maintaining ephemeral server storage so that once a message is delivered it lives only on client devices.

Three hard problems must be solved simultaneously: keeping persistent TCP (Transmission Control Protocol) connections alive for every active user at planetary scale; coordinating delivery receipts across a sender, multiple recipient devices, and the server; and doing all of this end-to-end encrypted so the server never sees plaintext — meaning even WhatsApp’s own engineers cannot read your messages. Understanding how these three constraints interact is the central challenge of this design.

2. Problem Statement
#

Functional Requirements
#

Users can send and receive 1-on-1 text messages.
Users can send and receive group messages (up to 1,024 members).
Messages carry delivery status: Sent (one grey tick), Delivered (two grey ticks), Read (two blue ticks).
Users can send media: photos, videos, voice notes, documents.
Users can see a contact’s online/typing status and last-seen timestamp.
All messages are end-to-end encrypted; the server never holds decryptable content.
Messages sent while the recipient is offline are queued and delivered on reconnection.
Multi-device: the same account can be active on up to 4 linked devices simultaneously.

Non-Functional Requirements
#

Requirement	Target
Message delivery latency (online recipient)	< 200 ms P99
Message delivery latency (offline → reconnect)	< 1 s after reconnect
Availability	99.95% (< 4.4 h downtime/year)
Durability of undelivered messages	30-day server queue TTL (Time-To-Live)
Encryption	E2EE via Signal Protocol (no server-side plaintext)
Scale	2 B users, 100 B messages/day, ~1.16 M msgs/sec peak
Media storage	Exabytes; client-initiated uploads to object store

Out of Scope
#

Payments (WhatsApp Pay).
Business API / WhatsApp Business Platform.
Full-text message search across history.

3. Scale Estimation
#

Assumptions:

1 B Daily Active Users (DAU) out of 2 B registered users.
Average of ~100 messages sent per DAU per day.
Peak traffic is 3× the daily average (evening hours in major timezones overlap).
20% of messages include media (photos, video, voice).
Average media size: 500 KB after client-side compression.
5% of messages queue for offline recipients at any given moment.

Daily active users (DAU):          ~1 B
Messages/day:                      100 B  →  ~1.16 M msgs/sec average
Peak multiplier:                   3×     →  ~3.5 M msgs/sec
Concurrent WebSocket connections:  ~500 M (50% of DAU active at peak)

Message size (text, median):       ~200 bytes
Text message throughput:           1.16 M × 200 B = ~230 MB/s raw, ~2 TB/day

Media messages (~20% of total):    20 B/day
Average media size:                500 KB
Media upload bandwidth:            ~20 B × 500 KB / 86,400 s = ~115 GB/s (median)
Media storage (30-day retention):  20 B × 500 KB × 30 ≈ 300 PB (before dedup)

Undelivered message queue:
  ~5% of messages queue for offline recipients
  5 B messages × 200 B = 1 TB in queue at any time (manageable in Cassandra)

The most constraining resource is concurrent TCP connections (500 M sockets), not raw message throughput. Each socket consumes kernel memory (~4 KB minimum), so the gateway fleet needs ~2 TB of RAM just for connection state. This is why WhatsApp’s original Erlang/OTP stack (which handles millions of lightweight processes per node) was such a good fit — each process maps to one connection with almost no overhead.

4. High-Level Architecture
#

The system decomposes into five independent planes: a connection plane (WebSocket gateways holding persistent sockets), a routing plane (Message Service distributing messages to the right gateways), a storage plane (offline queue + key store), a media plane (client-direct uploads to object storage), and a presence plane (online/typing status).

flowchart TD
    subgraph CL["Client Layer"]
        A1["Alice — Phone"]
        A2["Alice — Linked Tablet"]
        B1["Bob — Phone"]
    end

    subgraph EDGE["Edge — Connection Plane"]
        LB["Load Balancer / Anycast"]
        GW1["Chat Gateway 1\nWebSocket Server"]
        GW2["Chat Gateway 2\nWebSocket Server"]
    end

    subgraph CORE["Core Services — Routing Plane"]
        MS["Message Service\nrouting + fan-out"]
        GS["Group Service\nmembership + fan-out"]
        PS["Presence Service\nonline / typing"]
        NS["Notification Service\nAPNs / FCM"]
        KS["Key Distribution Service\npublic keys + prekeys"]
    end

    subgraph STORE["Storage Layer"]
        MQ["Message Queue\nKafka"]
        MDB["Message Store\nCassandra — offline queue"]
        KDB["Key Store\nPostgreSQL"]
        PDB["Presence Store\nRedis"]
        OB["Media Object Store\nS3-compatible"]
    end

    A1 -->|"WebSocket / TLS"| LB
    A2 -->|"WebSocket / TLS"| LB
    B1 -->|"WebSocket / TLS"| LB
    LB --> GW1
    LB --> GW2

    GW1 -->|"publish"| MQ
    GW2 -->|"publish"| MQ

    MQ --> MS
    MS -->|"online delivery"| GW1
    MS -->|"online delivery"| GW2
    MS -->|"offline queue write"| MDB
    MS -->|"push notification"| NS
    MS --> GS

    GS -->|"member list"| MDB
    GS -->|"fan-out per member"| MS

    PS -->|"heartbeat reads/writes"| PDB
    GW1 -->|"presence update"| PS
    GW2 -->|"presence update"| PS

    A1 -->|"key fetch"| KS
    KS -->|"store/fetch"| KDB

    A1 -->|"media upload URL"| OB
    A1 -->|"download URL"| OB

Component Reference
#

Component	Technology	Role	Key Design Decision	Failure Behaviour
Chat Gateway	Erlang/Elixir or Java (Netty)	Terminates every WebSocket (a persistent, full-duplex TCP connection over HTTP Upgrade) from client devices. Authenticates clients via a session token on connect. Forwards outbound messages to Kafka. Pushes inbound messages from the Message Service down to the connected client. Maintains a local in-memory map of `{user_id → connection_handle}`.	Each gateway node registers itself in a routing registry (Redis/etcd): `{user_id → gateway_id}`. When the Message Service needs to deliver a message to user U, it looks up U's gateway and calls it directly via gRPC (Google Remote Procedure Call). This avoids broadcasting to all gateways.	Gateway crash → clients reconnect (exponential backoff: 1 s → 2 s → … → 60 s). The routing registry entry expires via TTL. Messages in-flight are already in Kafka and will be re-delivered by the Message Service.
Message Service	Java / Go microservice	Consumes messages from Kafka. Looks up the recipient's gateway in the routing registry. If the recipient is online, delivers the message via gRPC to that gateway. If offline, writes the message to the Cassandra offline queue and triggers a push notification (APNs for Apple devices, FCM (Firebase Cloud Messaging) for Android) to wake the client app.	Delivery is attempted exactly once per Kafka offset. The offline queue write and push notification happen only when the online delivery attempt fails (no routing registry entry for the recipient). This avoids double-delivery.	Kafka consumer lag → message delivery delays. Auto-rebalancing and partition scaling are the primary mitigations. Consumer lag is the most-watched operational metric.
Group Service	Java microservice	Resolves group membership from Cassandra (group_id → list of member_ids). For each member, generates a per-member delivery task and publishes it back to Kafka, partitioned by recipient_id. The Message Service then handles each delivery task independently, so a group of 1,024 members results in 1,024 independent delivery paths — parallelising naturally across gateway nodes.	Fan-out on write: one sender message becomes N delivery tasks. For small groups this is cheap. For large groups (close to 1,024) the write amplification is significant but bounded. Sender Keys (see Section 8) prevent O(n) encryption overhead on the sender side.	Membership list cache miss → falls back to Cassandra read. Stale cache → member misses a message; corrected on next sync. Group membership changes (adds/removes) invalidate the cache via a Kafka event.
Presence Service	Go service + Redis	Tracks online/typing state in Redis with short TTLs (Time-To-Live). When a client connects, the gateway calls the Presence Service to set a Redis key: `presence:{user_id} = {gateway_id, connected_at}` with a 90 s TTL. Every 60 s heartbeat refreshes the TTL. On disconnect, the key is deleted (or naturally expires). Subscribers (mutual contacts) receive a presence push when state changes.	Presence is privacy-gated: only contacts who are in your address book AND whom you haven't blocked receive real-time presence. This limits fan-out for high-follower accounts.	Redis node failure → Redis Cluster replica promotion. Short TTL means stale presence resolves within 90 s even without an explicit delete. Presence is best-effort — brief inaccuracy is acceptable.
Key Distribution Service (KDS)	Java microservice + PostgreSQL	Stores and serves Signal Protocol public key material for every device. When Alice sends a first message to Bob, her client fetches Bob's prekey bundle (Identity Key, Signed PreKey, One-Time PreKey) from the KDS to perform the X3DH (Extended Triple Diffie-Hellman) handshake. After the handshake, messages flow over the Double Ratchet without contacting the KDS again.	One-Time PreKeys are consumed on use. The KDS monitors the remaining prekey count per device and warns clients when they drop below 10. Clients proactively upload batches of 100 new prekeys. If a device's one-time prekeys are exhausted, the KDS falls back to the Signed PreKey (still secure, but loses per-message forward secrecy for that session initiation).	KDS unavailable → gateways serve from a local Redis cache of recently fetched prekey bundles. Client retries with exponential backoff. New sessions to new contacts fail gracefully; existing ratchet sessions are unaffected.
Notification Service	Go service	Sends a silent push notification (a background wake-up signal with no visible UI) to offline clients via Apple APNs (Apple Push Notification service) or Google FCM. The push carries no message content — only a signal to open a WebSocket and sync. This design preserves E2EE: the push providers (Apple and Google) never see message content.	Silent push is a best-effort hint. If push fails (device offline, push token expired), the message stays in the Cassandra offline queue for up to 30 days. The user retrieves it manually on next app open.	Push provider outage → messages sit in queue. No data loss; only delayed delivery. Push token expiry → Message Service catches the error and marks the token as stale; client re-registers on next connect.
Offline Queue (Cassandra)	Apache Cassandra 4.x	Stores undelivered messages keyed by recipient_id. When Bob reconnects, his client sends a SYNC request; the Message Service streams all pending rows for Bob's recipient_id, then deletes them after Bob's client sends a DELIVERED ACK. Native Cassandra TTL (30 days = 2,592,000 s) handles automatic expiry of never-delivered messages without a cleanup job.	Partitioned by recipient_id — all pending messages for one user are co-located on the same Cassandra partition. This makes reconnect sync a single efficient range scan rather than a scatter-gather query across many nodes.	Node failure → Replication Factor 3 with QUORUM reads/writes (a majority of replicas must agree) masks single-node failures transparently. Multi-DC (Data Centre) replication provides geo-redundancy.
Media Object Store	S3-compatible object storage + CDN (Content Delivery Network)	Stores client-encrypted media blobs. The client encrypts the media locally with AES-256-CBC (Advanced Encryption Standard, 256-bit key, Cipher Block Chaining mode) + HMAC-SHA256 (Hash-based Message Authentication Code) before upload. The Media Service issues a pre-signed PUT URL valid for 5 minutes; the client uploads directly without proxying through the chat gateway. Recipients download via a time-limited GET URL.	Gateways are kept message-only — no binary payloads. This prevents bulk media traffic from consuming the gateway connection budget. Content-addressed keys (SHA256 of plaintext) enable server-side deduplication: if Alice and Bob both send the same photo, only one blob is stored.	Upload failure → client retries the PUT up to 3× (pre-signed URL has 5-min validity; client re-requests if expired). Media blobs deleted after download ACK or 30-day TTL, whichever comes first.

5. Deep Dive — Connection Layer (WebSocket)
#

Why WebSocket over HTTP polling?
#

HTTP long-polling works like this: the client sends a request, the server holds it open until a message arrives, sends a response, and then the client immediately sends another request. This creates a constant cycle of connection teardown and re-establishment, which wastes server file descriptors and adds one full RTT (Round-Trip Time, typically 20–100 ms on mobile) of latency on every message. At 500 M concurrent users, that overhead is catastrophic.

WebSocket upgrades a standard HTTP connection into a persistent, full-duplex TCP channel. After the upgrade handshake, both client and server can send frames at any time without the request-response ceremony. A message from Alice reaches Bob in exactly one RTT once both sides are connected — no polling, no re-connection overhead.

The WebSocket handshake over TLS (Transport Layer Security — the encryption protocol that replaces the older SSL) looks like this:

Client → Server:  GET /ws HTTP/1.1
                  Upgrade: websocket
                  Connection: Upgrade
                  Sec-WebSocket-Key: <base64 nonce>

Server → Client:  HTTP/1.1 101 Switching Protocols
                  Upgrade: websocket
                  Connection: Upgrade
                  Sec-WebSocket-Accept: <HMAC of nonce>

After this exchange, the TCP connection is handed off to the WebSocket framing layer. The original HTTP server is no longer involved.

Gateway design and the routing registry
#

Each Chat Gateway is a stateful process holding tens of thousands of open WebSocket connections. The key challenge: when the Message Service wants to deliver a message to Bob, which of the potentially hundreds of gateway nodes is Bob connected to?

The answer is a routing registry — a Redis hash that maps user_id → gateway_id. Every gateway registers its connected users here on connect and deregisters on disconnect. The lookup is O(1).

Bob connects to GW-7:
  Redis: SET routing:{bob_id} "gw-7" EX 90

Message Service receives message for Bob:
  1. Redis GET routing:{bob_id}  →  "gw-7"
  2. gRPC call to GW-7: deliver(msg)
  3. GW-7 writes msg to Bob's WebSocket

Bob's heartbeat every 60 s:
  GW-7 refreshes:  EXPIRE routing:{bob_id} 90

If the routing registry returns no entry (Bob is offline), the Message Service writes to the Cassandra queue instead.

For multi-device support (Bob on phone + tablet), Bob has two routing entries — one per device — and the Message Service delivers to all of them in parallel.

Keep-alive and reconnection
#

Mobile networks aggressively kill idle TCP connections (NAT (Network Address Translation) timeout is often 30 s on cellular). Clients send a WebSocket ping frame every 60 seconds; the gateway responds with a pong frame. Missing two consecutive pings (120 s of silence) causes the gateway to close the connection and delete the routing entry.

Clients implement exponential backoff on reconnect: 1 s, 2 s, 4 s, 8 s, … capped at 60 s. This prevents a thundering herd (a situation where thousands of clients all reconnect simultaneously after a gateway restart) from overwhelming the recovering gateway.

Java: connection registry interaction
#

// Simplified gateway connection registration (Java 21, virtual threads)
public class GatewayConnectionHandler {

    private final RedisClient redis;
    private final String gatewayId;
    private static final int PRESENCE_TTL_SECONDS = 90;

    public void onConnect(String userId, WebSocketSession session) {
        // Register in routing registry — O(1) lookup for Message Service
        redis.set("routing:" + userId, gatewayId, PRESENCE_TTL_SECONDS);
        localSessions.put(userId, session);
    }

    public void onHeartbeat(String userId) {
        // Refresh TTL so the entry doesn't expire while client is active
        redis.expire("routing:" + userId, PRESENCE_TTL_SECONDS);
    }

    public void onDisconnect(String userId) {
        redis.delete("routing:" + userId);
        localSessions.remove(userId);
    }

    public void deliver(String userId, byte[] messageFrame) {
        WebSocketSession session = localSessions.get(userId);
        if (session != null && session.isOpen()) {
            session.sendBinary(messageFrame); // non-blocking with virtual threads
        }
    }
}

6. Deep Dive — Message Flow & Ordering
#

Send path (Alice → Bob, both online)
#

The numbered steps below trace a single message from Alice’s keypress to Bob’s screen. Pay close attention to where the ACKs (acknowledgements) travel — this is what drives the tick system.

sequenceDiagram
    participant Alice
    participant GW_A as Gateway A
    participant Kafka
    participant MsgSvc as Message Service
    participant GW_B as Gateway B
    participant Bob

    Alice->>GW_A: SEND {msg_id, to: Bob, ciphertext, seq}
    GW_A->>Kafka: publish(msg)
    GW_A-->>Alice: ACK — one grey tick (Sent)

    Kafka->>MsgSvc: consume(msg)
    MsgSvc->>GW_B: deliver(msg) — Bob online on GW_B
    GW_B->>Bob: PUSH message

    Bob-->>GW_B: DELIVERED ACK {msg_id}
    GW_B->>MsgSvc: forward delivered receipt
    MsgSvc->>GW_A: delivery_receipt(msg_id, bob)
    GW_A->>Alice: two grey ticks (Delivered)

    Bob-->>GW_B: READ ACK {conversation_id, up_to_msg_id}
    GW_B->>MsgSvc: forward read receipt
    MsgSvc->>GW_A: read_receipt(conversation_id)
    GW_A->>Alice: two blue ticks (Read)

Why does the single tick appear before the message reaches Bob? Because the single tick means the server received the message — not that Bob received it. This is an intentional design: Alice gets instant feedback that her message is safely in the server queue, even if Bob’s device is slow to respond. The two-tick system requires a round-trip to Bob’s device, which may take longer on a slow network.

Why does the delivered receipt go server→Gateway A→Alice instead of server→Alice directly? Because Alice may not be on the same gateway as Bob. The Message Service acts as an intermediary that knows which gateway each user is on. It routes the receipt through Alice’s gateway just like a normal message, but in the reverse direction.

Ordering guarantees
#

WhatsApp guarantees per-conversation ordering (not global ordering across all your conversations):

Each client assigns a monotonically increasing client sequence number per conversation (seq = 1, 2, 3, …). This number is included in every SEND frame.
Kafka preserves insertion order within a partition. All messages in a conversation are sent to the same Kafka partition, keyed by conversation_id. This means the Message Service always consumes messages in the order the sender sent them.
If the recipient’s client detects a gap in sequence numbers (seq jumps from 5 to 7, meaning 6 is missing), it sends a SYNC request to fetch the missing message from the Cassandra queue.

For group messages, the Group Service assigns a server-side sequence number at fan-out time, giving all members a consistent global order within the group conversation — even if two members sent messages at the same millisecond.

Offline delivery
#

When Bob is offline at delivery time:

The Message Service writes to Cassandra: (recipient_id=bob, conversation_id, server_ts, msg_id) → ciphertext_blob.
Triggers a silent push (an invisible background notification that wakes the app without showing a banner) via FCM or APNs.
On reconnect, Bob’s client sends SYNC {last_server_ts: T}. The Message Service does a Cassandra range scan for all rows where server_ts > T and recipient_id = bob, streams them to Bob’s gateway, which pushes them to Bob’s client.
Bob’s client sends a DELIVERED ACK for each message. The Message Service deletes those rows from Cassandra.

TTL: 30 days. After TTL, messages are dropped from the queue and the sender receives a permanent delivery failure notification. WhatsApp does not store messages beyond this point — the server is a transient relay, not a long-term archive.

7. Deep Dive — End-to-End Encryption (Signal Protocol)
#

WhatsApp adopted the Signal Protocol (developed by Open Whisper Systems) in 2016. The server routes encrypted blobs and never holds the keys to decrypt them. Even a subpoena to WhatsApp yields only metadata (who talked to whom, when), not message content.

Key types and their roles
#

Key	Full Name	Purpose	Lifetime
IK	Identity Key (ED25519 curve — a specific elliptic curve designed for fast, secure signatures)	Long-term device identity; proves you are who you say you are	Device lifetime
SPK	Signed PreKey	Medium-term key; signed by the IK to prove authenticity; used during session establishment	~30 days, rotated monthly
OPK	One-Time PreKey	Single-use ephemeral key; prevents replay attacks (an attacker capturing a handshake cannot replay it later because the OPK is consumed)	Single session
EK	Ephemeral Key	Per-session DH (Diffie-Hellman — a mathematical operation that lets two parties compute a shared secret without ever transmitting that secret) ratchet key	Per message batch

X3DH handshake — establishing a session with a new contact
#

X3DH stands for Extended Triple Diffie-Hellman. When Alice sends her first message to Bob (someone whose session she has never opened), she cannot just encrypt because she doesn’t share a secret with Bob yet. X3DH solves this without Alice and Bob needing to be online at the same time.

Alice fetches from Key Distribution Service:
  Bob's IK_B (Identity Key),
  SPK_B (Signed PreKey, with IK_B's signature),
  OPK_B (One-Time PreKey — consumed after this)

Alice generates her own ephemeral key pair: EK_A

Alice computes four DH operations:
  DH1 = DH(IK_A,  SPK_B)   — "I know Bob's medium-term key"
  DH2 = DH(EK_A,  IK_B)   — "Bob knows my ephemeral key via his identity key"
  DH3 = DH(EK_A,  SPK_B)   — "ephemeral meets medium-term"
  DH4 = DH(EK_A,  OPK_B)   — "one-time use ensures no replay"

Master secret = KDF(DH1 || DH2 || DH3 || DH4)
  KDF = Key Derivation Function (HKDF-SHA256 in practice)
  || = concatenation

Why four DH operations? Each one contributes a different security property. Combining all four means an attacker would have to break all four independently to compromise the session — defence in depth at the cryptographic layer.

This gives forward secrecy: if Bob’s long-term IK is later compromised, past sessions (which used OPKs that are now deleted) cannot be decrypted.

Double Ratchet — per-message key evolution
#

After the X3DH handshake, every subsequent message advances two ratchets (think of a ratchet as a one-way click — each click derives a new key from the previous, but you cannot go backwards):

Symmetric-Key Ratchet (KDF chain): each message derives a unique message key from the current chain key. The message key is used once then discarded — even if an attacker records all traffic and later steals the current chain key, past messages (which used already-discarded message keys) remain undecryptable.
Diffie-Hellman Ratchet: every time Alice and Bob exchange new messages, they also exchange new ephemeral DH public keys. This rotates the root key, providing break-in recovery: if an attacker somehow steals the current session state, future messages (encrypted with the next DH ratchet step) become inaccessible to the attacker.

The practical result: every single message in a WhatsApp conversation uses a unique key, derived from a key that existed briefly and was then deleted.

Multi-device E2EE
#

Each linked device (up to 4 per account) has its own independent Identity Key and prekey bundle registered in the KDS. When Alice sends a message:

Alice’s primary device encrypts the plaintext once per recipient device — one ciphertext for Bob’s phone, one for Bob’s tablet.
Additionally, one ciphertext is encrypted for each of Alice’s own linked devices — so her tablet also shows the sent message.
For a message from Alice (2 devices) to Bob (3 devices): the KDS bundles fetched are for 5 devices, and 5 independent ciphertexts are generated from the same plaintext.

This is the correct E2EE approach — the alternative (encrypting once and having devices share a key) would mean any one compromised device exposes all devices.

8. Deep Dive — Group Messaging
#

Sender Keys — why group encryption is different
#

In a 1-on-1 conversation, Alice encrypts once for Bob. In a group with N members, naively Alice would encrypt once per member: O(N) encryption operations per message. At 1,024 members sending several messages per hour each, this becomes prohibitive.

Sender Keys solve this. The sender (Alice) generates a symmetric Group Session Key — a shared AES key for this group session. She then:

Encrypts the Group Session Key once per member using their pairwise Signal session (X3DH + Double Ratchet). This happens once, or whenever the key rotates.
For every subsequent group message, encrypts the plaintext with the Group Session Key — a single fast AES operation, O(1) regardless of group size.

First message to group (N members):
  Encrypt Group Session Key N times (one per member) → O(N)

Every subsequent message:
  Encrypt plaintext with Group Session Key → O(1)

The Group Session Key rotates (a new key is generated and re-distributed) whenever:

A member leaves the group — the departing member must not be able to decrypt future messages (forward secrecy for departed members).
A member is removed by an admin.

Key rotation requires another O(N) distribution round — the cost of maintaining group membership changes.

Fan-out for large groups (up to 1,024 members)
#

The Sender Key protocol handles the encryption side. The delivery side is handled by the Group Service fanning out one delivery task per member to Kafka:

1 group message → Group Service reads membership list (N members)
                → N delivery tasks published to Kafka
                  (partitioned by recipient_id, not by group_id)
                → Message Service handles each task independently
                → Each member's gateway delivers in parallel

Thundering herd mitigation: publishing all 1,024 tasks to the same Kafka partition would create a serialisation bottleneck (only one consumer processes a partition at a time). Instead, tasks are partitioned by recipient_id, distributing them across many partitions and consumer instances — fully parallel delivery.

Group metadata schema
#

Table	Partition Key	Clustering Key	Value
`groups`	`group_id`	—	name, avatar_url, created_at, owner_id
`group_members`	`group_id`	`member_id`	role (admin/member), joined_at
`group_sender_keys`	`(group_id, sender_id, device_id)`	`recipient_device_id`	sender key ciphertext

The group_sender_keys table is the most write-amplified: every time Alice sends a group message for the first time (or after a key rotation), one row per (sender device, recipient device) pair is written.

9. Deep Dive — Media Handling
#

Upload flow — why the gateway never touches binary data
#

The Chat Gateway’s job is to hold hundreds of millions of persistent connections. If media bytes flowed through the gateway, a single user uploading a 50 MB video would consume 50 MB of gateway RAM for the duration of the upload — and with millions of concurrent uploads, this would exhaust gateway memory in seconds. Instead:

1. Alice selects a photo on her phone.

2. Alice's client encrypts the media blob locally:
   - AES-256-CBC for confidentiality
   - HMAC-SHA256 for integrity (ensures the blob was not tampered with in transit)
   - Unique IV (Initialisation Vector) and media key generated per upload.

3. Client calls Media Service → receives a pre-signed S3 PUT URL (valid 5 min).
   A pre-signed URL is a time-limited, single-object-scoped URL that grants
   write permission to one specific S3 key without exposing AWS credentials.

4. Client uploads the encrypted blob directly to object storage.
   The Chat Gateway is not involved in this step.

5. Client sends a text message containing:
   {
     media_url:          "https://cdn.whatsapp.net/.../blob.enc",
     encrypted_media_key: <AES key, itself encrypted for recipient>,
     media_sha256:        <integrity hash of the encrypted blob>,
     mime_type:          "image/jpeg",
     thumbnail_ciphertext: <tiny blurred preview, also encrypted>
   }

6. Bob's client downloads the blob from media_url, verifies the SHA256,
   decrypts with the media key, and renders the image.

The server stores an encrypted blob it cannot read. The media key inside the message is itself encrypted end-to-end — only Bob’s device can decrypt it.

Content deduplication
#

Clients compute a SHA256 of the plaintext before encryption. The Media Service maintains a content-addressed index (plaintext hash → S3 key). If the same photo was uploaded recently (e.g., many people sharing a viral meme), the server issues a download URL for the existing blob without requiring a new upload — dramatically reducing storage and egress costs. This deduplication is invisible to the encryption model: the encrypted blob on disk may differ per sender (because different IVs produce different ciphertexts for the same plaintext), but the server-side plaintext hash comparison still works because the Media Service computes it before encryption (the client sends the plaintext hash alongside the encrypted upload).

10. Deep Dive — Presence & Typing Indicators
#

Online/Offline presence
#

Presence is implemented as a short-TTL key in Redis, refreshed by heartbeats:

Client connects → Gateway sets:
  Redis SETEX presence:{user_id} 90 "{gateway_id, connected_at}"

Client heartbeat (every 60 s) → Gateway refreshes:
  Redis EXPIRE presence:{user_id} 90

Client disconnects → Gateway deletes:
  Redis DEL presence:{user_id}
  (or the TTL expires naturally if the disconnect is unclean — e.g., phone battery dies)

When Alice’s presence key is created or expires, the Presence Service fans out a presence update to subscribers (Alice’s mutual contacts who have opted into presence visibility). This fanout is itself bounded by WhatsApp’s privacy model: “last seen” and online status are only shared with contacts, and can be restricted further in Settings.

Typing indicators
#

Typing indicators are the most ephemeral signals in the system — they have no durability requirement, no storage, and no retry:

When the user types the first character, the client sends a COMPOSING event.
After 5 seconds of no keystrokes, the client sends a PAUSED event.
If the network drops during this, the indicator simply disappears — no retry, no queue, no Cassandra write.
The COMPOSING event is forwarded by the gateway in-band over the WebSocket connection directly to the recipient’s gateway via the Message Service. The Presence Service is not involved (keeping the path as low-latency as possible — a separate service hop would add ~5 ms of unnecessary overhead).

11. Data Models
#

Offline message queue — Apache Cassandra
#

Cassandra is chosen over a relational database (like PostgreSQL) because it is append-optimised: writes are sequential (appended to a commit log, then a sorted string table — SSTable). At billions of messages per day, PostgreSQL’s B-tree index maintenance costs would create write bottlenecks. Cassandra’s wide-row data model also perfectly matches the access pattern: append messages for a user, then bulk-read-and-delete all of them on reconnect.

CREATE TABLE offline_messages (
    recipient_id     UUID,
    conversation_id  UUID,
    server_ts        TIMESTAMP,
    msg_id           UUID,
    sender_id        UUID,
    ciphertext       BLOB,
    msg_type         TINYINT,   -- 0=text, 1=media, 2=control (receipts, key updates)
    ttl              TIMESTAMP,
    PRIMARY KEY ((recipient_id), server_ts, msg_id)
) WITH CLUSTERING ORDER BY (server_ts ASC)
  AND default_time_to_live = 2592000;  -- 30 days in seconds

Why recipient_id as the partition key? All pending messages for one user live on the same Cassandra node (or a small set of replicas). When Bob reconnects and requests a sync, the Message Service issues a single range scan: SELECT * FROM offline_messages WHERE recipient_id = bob AND server_ts > last_sync_ts. This is one sequential disk read, not a scatter-gather across 100 nodes.

Why compound clustering on (server_ts, msg_id)? server_ts provides chronological ordering. msg_id (a UUID) breaks ties when two messages arrive in the same millisecond and ensures uniqueness.

Conversation index — Cassandra
#

CREATE TABLE conversations (
    user_id          UUID,
    conversation_id  UUID,
    counterpart_id   UUID,    -- other user_id (1-on-1) or group_id (group chat)
    is_group         BOOLEAN,
    last_msg_ts      TIMESTAMP,
    last_msg_preview BLOB,    -- encrypted snippet for local display (never decryptable server-side)
    unread_count     INT,
    PRIMARY KEY ((user_id), last_msg_ts, conversation_id)
) WITH CLUSTERING ORDER BY (last_msg_ts DESC);

Clustered by last_msg_ts DESC so the most recently active conversations are at the top of the Cassandra partition — matching the inbox view that users see on app open.

Public key storage — PostgreSQL
#

A relational database fits here because key material is looked up by exact (user_id, device_id) primary key — no range scans, no time-series patterns. PostgreSQL’s ACID (Atomicity, Consistency, Isolation, Durability) guarantees are important for key operations: you must not lose a prekey upload or double-consume a one-time prekey.

CREATE TABLE identity_keys (
    user_id      UUID,
    device_id    UUID,
    identity_key BYTEA  NOT NULL,
    created_at   TIMESTAMPTZ DEFAULT now(),
    PRIMARY KEY (user_id, device_id)
);

CREATE TABLE prekeys (
    user_id       UUID,
    device_id     UUID,
    prekey_id     INT,
    prekey_type   VARCHAR(10),  -- 'signed' or 'onetime'
    public_key    BYTEA,
    signature     BYTEA,        -- present only for signed prekeys (proves IK signed it)
    used          BOOLEAN DEFAULT FALSE,
    PRIMARY KEY (user_id, device_id, prekey_id)
);

-- Index to efficiently count remaining one-time prekeys per device
CREATE INDEX ON prekeys (user_id, device_id) WHERE prekey_type = 'onetime' AND used = FALSE;

The KDS alerts a client to upload new prekeys when this count falls below 10. If it reaches 0, the server falls back to the Signed PreKey for new session initiations — still secure, but with reduced forward secrecy for that specific handshake.

12. API Design
#

WhatsApp uses a custom binary framing protocol over TLS (originally XMPP (Extensible Messaging and Presence Protocol), now replaced by a compact binary format similar to Protocol Buffers). Below are the logical semantics expressed as WebSocket event types and REST-style endpoints.

WebSocket events (client → server)
#

Event	Payload	Server Response
`SEND`	`{msg_id, to, conversation_id, seq, ciphertext, msg_type}`	`ACK {msg_id, server_ts}` — one grey tick
`DELIVERED`	`{msg_id, from}`	Forwarded to sender as receipt
`READ`	`{conversation_id, up_to_msg_id}`	Forwarded to sender as read receipt
`COMPOSING`	`{conversation_id}`	Forwarded to recipient in-band
`SYNC`	`{last_server_ts}`	Stream of `PUSH` events from offline queue
`FETCH_KEYS`	`{user_id[]}`	`{user_id → prekey_bundle}` per device
`UPLOAD_PREKEYS`	`{signed_prekey, one_time_prekeys[]}`	`ACK`

Media REST endpoints
#

Method	Path	Purpose
`POST`	`/v1/media/upload`	Returns `{upload_url (pre-signed S3 PUT), media_id}`
`GET`	`/v1/media/{media_id}`	Returns time-limited download URL (pre-signed S3 GET)

Key Distribution Service REST endpoints
#

Method	Path	Purpose
`GET`	`/v1/keys/{user_id}`	Fetch prekey bundle for all of a user’s devices
`POST`	`/v1/keys/batch`	Fetch bundles for a list of user_ids (used on group key distribution)
`PUT`	`/v1/keys/self`	Upload new signed prekey or batch of one-time prekeys

13. Trade-offs & Alternatives
#

Server-side message storage vs. client-only vs. permanent storage
#

Approach	Used by	Pro	Con
Transient server queue (current)	WhatsApp	Limited legal exposure; low storage cost; server is a relay not an archive	No native history sync across devices (relies on iCloud/Google Drive backups)
No server storage	Signal	Maximum privacy; server has nothing to hand over	Offline delivery requires push wake-up; messages lost if push fails and client never reconnects
Permanent server storage	Telegram (optional E2EE)	History sync across devices; messages recoverable	Server holds message history; legal and privacy exposure; storage costs at scale

WhatsApp’s 30-day queue is a deliberate middle path: enough durability for typical offline periods (vacation, phone replacement) without becoming a permanent archive.

Fan-out on write vs. fan-out on read for groups
#

Strategy	Write cost	Read cost	Best for
Fan-out on write (current)	O(N) delivery tasks per message	O(1) — message already in each member’s queue	Groups ≤ 1,024 members
Fan-out on read	O(1) — one message stored	O(N) — each reader polls	Broadcast channels with millions of subscribers

For very large groups or broadcast channels (WhatsApp’s Channels product), fan-out on read becomes more attractive: store one copy, let members pull. WhatsApp’s 1,024-member limit exists partly to keep fan-out on write tractable.

XMPP vs. custom binary protocol
#

Original WhatsApp used XMPP, an XML-based messaging protocol. Each XML tag adds bytes — a simple message could be ~1 KB on the wire. Moving to a custom binary framing (similar to protobuf length-prefixed frames) reduced per-message overhead from ~1 KB to ~100 bytes. At 100 B messages/day, that 10× reduction saves ~100 TB/day of network bandwidth — roughly $10 M/year at cloud egress pricing.

WebSocket vs. QUIC
#

QUIC (the transport protocol underlying HTTP/3) provides two advantages over WebSocket-over-TLS:

0-RTT (Zero Round-Trip Time) resumption: reconnecting after a network change (WiFi → LTE) takes near-zero time instead of a full TCP + TLS handshake (~300 ms).
No head-of-line blocking: TCP blocks all data behind a lost packet; QUIC streams are independent, so one lost packet only delays that stream.

WhatsApp has explored QUIC but WebSocket over TLS remains the primary protocol. The switch requires client + server updates and careful fallback handling for older OS versions that don’t support QUIC.

14. Failure Modes & Mitigations
#

Failure	Impact	Mitigation
Gateway crash	All connections on that node drop	Clients reconnect with exponential backoff; routing registry TTLs expire within 90 s; Cassandra offline queue absorbs in-flight messages; Kafka consumer replays from the last committed offset
Kafka partition lag	Message delivery delays; visible as increasing grey-tick to double-tick delay	Per-partition consumer lag alerts at >10,000 messages; increase partition count and consumer instances; auto-rebalancing
Cassandra node failure	Offline queue reads degrade	RF=3 with QUORUM reads/writes; single-node failure is invisible. Multi-DC (Multi Data Centre) replication for geo-redundancy
KDS unavailable	Cannot open new E2EE sessions with contacts	Redis cache of recently fetched prekey bundles on gateways; existing ratchet sessions unaffected (no KDS needed after session init); client retries
One-time prekey exhaustion	Session initiation for Bob falls back to Signed PreKey	Still cryptographically secure; slightly reduced forward secrecy for that handshake. Client warned to upload prekeys on next connect
Media upload failure	Message stuck at “uploading”	Client retries PUT to pre-signed URL up to 3×; if URL expired (5 min TTL), client re-requests a new URL and retries
Push notification failure (FCM/APNs outage)	Offline users not woken up	Message remains in Cassandra for 30 days; user retrieves on next manual app open. No data loss, only delivery delay
DDoS / connection flood	Gateway resource exhaustion	Anycast routing absorbs geographically distributed attacks; per-IP connection cap at the load balancer; CAPTCHA on suspicious registration patterns; rate limiting on the SEND event
Routing registry (Redis) unavailable	Message Service cannot find recipient gateways	Message Service falls back to broadcasting to all gateways (expensive but functional); alternatively treats all users as offline and queues to Cassandra

15. Monitoring & Observability
#

Key metrics (RED framework: Rate, Errors, Duration)
#

Metric	What it measures	Alert threshold
P99 end-to-end message latency	Time from `SEND` to recipient’s `PUSH`	> 500 ms
Kafka consumer lag (per partition)	Backlog of unprocessed messages	> 10,000 messages
Offline queue depth (Cassandra row count)	Volume of messages awaiting delivery	> 5 B rows
WebSocket connection count per gateway	Proximity to gateway capacity	> 90% of max
One-time prekey inventory per device	Risk of session initiation degradation	< 10 remaining
Media upload success rate	Upload pipeline health	< 99.5%
Delivery ACK rate within 5 s	Online delivery reliability	< 98%
Routing registry miss rate	% of deliveries that fall back to offline queue	> 10% (indicates broad connectivity issue)

Distributed tracing
#

Every message carries a trace_id that propagates through: Client SEND → Kafka publish → Message Service consume → Gateway deliver → Client PUSH → DELIVERED ACK → Sender receipt

This full trace (implemented with OpenTelemetry, a vendor-neutral observability standard) allows engineers to pinpoint exactly where a specific message was delayed — whether in the Kafka consumer, the routing registry lookup, the gRPC call to the target gateway, or the WebSocket write itself.

Dashboards
#

Connection health: active WebSocket counts per gateway, reconnect rate, heartbeat miss rate, gateway memory utilisation.
Message pipeline: Kafka throughput (messages/sec), consumer group lag per partition, delivery success rate, grey-tick to double-tick latency distribution.
E2EE health: prekey fetch error rate, prekey exhaustion incidents, KDS cache hit ratio.
Media: upload success rate, CDN (Content Delivery Network) cache-hit ratio, median upload latency, pre-signed URL expiry failures.

16. Interview Signals
#

What separates a strong answer
#

Dimension	Mid-level answer	Senior / Staff answer
Transport	“Use HTTP” or “Use long-polling”	Argues for WebSocket; explains why persistent TCP beats polling; mentions QUIC and 0-RTT trade-off; knows NAT keepalive constraints
Ordering	“Use timestamps”	Client sequence numbers per conversation; server sequence numbers for groups; explains why wall-clock timestamps are unreliable (clock skew, NTP drift)
Offline delivery	“Store messages in a database”	Ephemeral Cassandra queue keyed by recipient_id; TTL; DELIVERED ACK triggers delete; silent push to wake client; 30-day TTL policy
Encryption	“Use HTTPS”	X3DH handshake for session initiation; Double Ratchet for per-message forward secrecy and break-in recovery; Sender Keys for groups; explains why server routing encrypted blobs provides no plaintext
Group scale	“Same as 1-on-1, just send to all members”	Sender Key protocol to avoid O(N) per-message encryption; fan-out via Kafka partitioned by recipient_id; membership cache invalidation on member changes
Multi-device	Often missed entirely	Per-device Identity Key; separate prekey bundle per device; message encrypted once per device; KDS must have prekeys for all linked devices
Receipts	“Server sends back an ACK”	Three-level receipt (sent / delivered / read); DELIVERED ACK comes from the recipient device (not the server); READ ACK comes when user opens the conversation

Common mistakes to avoid
#

Using a relational database as the primary message store — B-tree index maintenance creates write bottlenecks at 100 B messages/day. Cassandra’s LSM (Log-Structured Merge) tree and wide-row model are purpose-built for append-heavy workloads.
Storing decryptable messages on the server — breaks E2EE. The server should hold only ciphertext blobs it cannot decrypt. State this explicitly; interviewers at privacy-focused companies will probe this.
Encrypting group messages once per member per message — O(N²) for active group threads (N members × N messages). Sender Keys bring this to O(N) for key distribution and O(1) per subsequent message.
Conflating server ACK with delivery ACK — the single grey tick (server received) and double grey tick (recipient device received) are different signals with different latency profiles. The server ACK is synchronous; the device ACK requires a round-trip to the recipient.
Single Kafka topic without partitioning — all messages on one partition is a serialisation bottleneck. Partition by conversation_id to preserve ordering within each conversation without global serialisation.
Forgetting about multi-device — a single user can have 4 devices. Every delivery decision (online lookup, offline queue, push notification) must iterate over all devices. Every message encrypted for a contact must also be encrypted for the sender’s own linked devices.

17. Scaling Path
#

Phase	Scale	What breaks first	Key change
MVP	< 10 K users	Nothing critical	Single WebSocket server, PostgreSQL for messages, Redis for presence
Growth	10 K → 1 M users	Single WebSocket server CPU; PostgreSQL write throughput	Horizontal gateway scaling; routing registry (Redis); migrate messages to Cassandra
Scale	1 M → 100 M users	Kafka partition count; Cassandra cluster size; KDS read throughput	Increase Kafka partitions; Cassandra multi-DC; KDS Redis cache tier
Planet	100 M → 2 B users	Gateway RAM (500 M sockets); routing registry size; media egress cost	Gateway fleet expansion; regional Anycast; CDN for media; Sender Keys for all groups; QUIC exploration

18. Summary
#

WhatsApp’s architecture is built around four core principles:

Persistent connections over polling — WebSocket gateways hold hundreds of millions of concurrent sockets, routed via a Redis registry, enabling sub-200 ms delivery for online recipients.
Ephemeral server storage — Cassandra holds undelivered messages transiently; on DELIVERED ACK the row is deleted, limiting legal exposure and storage cost.
End-to-end encryption by default — X3DH bootstraps sessions (no prior shared secret needed); Double Ratchet ensures forward secrecy and break-in recovery per message; Sender Keys amortise group encryption from O(N) per-message to O(1).
Decoupled media path — clients encrypt and upload media directly to object storage via pre-signed URLs, keeping the chat gateway message-only and avoiding bandwidth bottlenecks.

The delivery receipt system (Sent → Delivered → Read) is as architecturally interesting as the forward message path: it requires a reverse ACK flow from the recipient’s device, through the Message Service, back to the sender’s gateway — a fully bidirectional signalling system layered on top of the one-way message channel.

19. References
#

Instagram

24 April 2026·30 mins

System Design Classic System-Design Classic

1. Hook # Instagram processes 100 million photo and video uploads every day, serves 4.2 billion likes, and delivers personalised feeds to 500 million daily users — all while keeping image loads under 200ms anywhere in the world. The engineering challenge is three-layered: a media processing pipeline that converts every raw upload into five optimised variants before the first follower ever sees it; a hybrid fan-out feed that handles both 400-follower personal accounts and 300-million-follower celebrities without write amplification blowing up; and an Explore page that must surface genuinely relevant content from a corpus of 50 billion posts to users who have never explicitly stated what they want. Each layer has a distinct bottleneck, and solving one often creates pressure on the others.

1. Hook #

2. Problem Statement #

Functional Requirements #

Non-Functional Requirements #

Out of Scope #

3. Scale Estimation #

4. High-Level Architecture #

Component Reference #

5. Deep Dive — Connection Layer (WebSocket) #

Why WebSocket over HTTP polling? #

Gateway design and the routing registry #

Keep-alive and reconnection #

Java: connection registry interaction #

6. Deep Dive — Message Flow & Ordering #

Send path (Alice → Bob, both online) #

Ordering guarantees #

Offline delivery #

7. Deep Dive — End-to-End Encryption (Signal Protocol) #

Key types and their roles #

X3DH handshake — establishing a session with a new contact #

Double Ratchet — per-message key evolution #

Multi-device E2EE #

8. Deep Dive — Group Messaging #

Sender Keys — why group encryption is different #

Fan-out for large groups (up to 1,024 members) #

Group metadata schema #

9. Deep Dive — Media Handling #

Upload flow — why the gateway never touches binary data #

Content deduplication #

10. Deep Dive — Presence & Typing Indicators #

Online/Offline presence #

Typing indicators #

11. Data Models #

Offline message queue — Apache Cassandra #

Conversation index — Cassandra #

Public key storage — PostgreSQL #

12. API Design #

WebSocket events (client → server) #

Media REST endpoints #

Key Distribution Service REST endpoints #

13. Trade-offs & Alternatives #

Server-side message storage vs. client-only vs. permanent storage #

Fan-out on write vs. fan-out on read for groups #

XMPP vs. custom binary protocol #

WebSocket vs. QUIC #

14. Failure Modes & Mitigations #

15. Monitoring & Observability #

Key metrics (RED framework: Rate, Errors, Duration) #

Distributed tracing #

Dashboards #

16. Interview Signals #

What separates a strong answer #

Common mistakes to avoid #

17. Scaling Path #

18. Summary #

19. References #

Related