Instagram

Table of Contents

1. Hook
#

Instagram processes 100 million photo and video uploads every day, serves 4.2 billion likes, and delivers personalised feeds to 500 million daily users — all while keeping image loads under 200ms anywhere in the world. The engineering challenge is three-layered: a media processing pipeline that converts every raw upload into five optimised variants before the first follower ever sees it; a hybrid fan-out feed that handles both 400-follower personal accounts and 300-million-follower celebrities without write amplification blowing up; and an Explore page that must surface genuinely relevant content from a corpus of 50 billion posts to users who have never explicitly stated what they want. Each layer has a distinct bottleneck, and solving one often creates pressure on the others.

2. Problem Statement
#

Functional Requirements
#

Users can upload photos and short videos (up to 90-second Reels).
Users can follow/unfollow other users.
Home feed shows posts from followed accounts in ranked reverse-chronological order.
Stories: ephemeral 15-second clips/photos that auto-expire after 24 hours; poster can see who viewed.
Explore page: personalised grid of posts from non-followed accounts based on interest graph.
Hashtag search: query #tag, returns posts sorted by recency or engagement score.
Users can like, comment, and save posts.

Non-Functional Requirements
#

Attribute	Target
Feed read latency (p99)	< 300ms
Photo load latency (p99)	< 200ms (CDN-served)
Upload availability	99.95%
Feed availability	99.99%
Story expiry precision	< 5s after 24h TTL
Scale	500M DAU, 100M uploads/day

Out of Scope
#

Direct Messages (Instagram DMs)
Live streaming
Ad targeting and auction
Content moderation pipeline

3. Scale Estimation
#

Assumptions:

500M DAU; average user views feed ~8×/day, uploads ~0.2 posts/day.
Average followers: 300; celebrity threshold: 10,000 followers.
Photo sizes after processing: thumbnail (150×150 ~10KB), medium (640px ~80KB), high-res (1080px ~300KB).
80% of uploads are photos; 20% are videos averaging 5MB after transcode.
Stories: 500M per day, ~2MB average.

Metric	Calculation	Result
Post writes/s	100M / 86,400	~1,160/s
Feed reads/s	500M × 8 / 86,400	~46,300/s
Photo media storage/day	80M × (10 + 80 + 300) KB	~31 TB/day
Video media storage/day	20M × 5MB	~100 TB/day
Total media ingress/day		~131 TB/day
CDN egress	46,300 reads × 20 images × 80KB avg	~74 GB/s
Like writes/s	4.2B / 86,400	~48,600/s
Fan-out Redis writes/s	1,160 × 300 avg followers	~348,000/s
Stories storage/day	500M × 2MB	~1 PB/day (raw; tiered to cold after 24h)

Media storage and CDN egress dominate cost. At 131TB/day raw uploads, annual storage grows ~47 PB/year before deduplication and cold-tier archival. This is why Instagram uses aggressive transcoding (WebP for photos, H.265 for video) to cut delivery size by 30-50%.

4. High-Level Design
#

The system decomposes into four independent planes: an upload plane (ingest → process → distribute media), a feed plane (fan-out on write + ranked assembly on read), an explore plane (ML candidate generation + re-ranking), and a stories plane (ephemeral ingest with TTL-based expiry).

flowchart TD
    subgraph CL["Client Layer"]
        APP["Mobile / Web App"]
    end
    subgraph AL["API Layer"]
        GW["API Gateway\nAuth · Rate Limit · Routing"]
    end
    subgraph UP["Upload Plane"]
        PS["Post Service\nValidate · Persist · Publish"]
        S3R["S3 Raw Bucket\n(pre-signed upload URL)"]
        MP["Media Processor\nResize · Transcode · WebP"]
    end
    subgraph FP["Feed Plane — Write"]
        KF["Kafka\npost.created · story.created"]
        FO["Fan-out Service\nWorker Pool (Java 21 vthreads)"]
    end
    subgraph RP["Feed Plane — Read"]
        FS["Feed Service\nFetch · Merge · Rank"]
        HY["Hydration Service\nBatch-fetch post objects"]
    end
    subgraph EX["Explore Plane"]
        EXS["Explore Service"]
        MLR["ML Ranker\nTwo-Tower + GBM"]
        VDB["Vector DB\nEmbeddings (HNSW)"]
    end
    subgraph SL["Storage Layer"]
        CAS["Cassandra\nPosts · Comments · Stories"]
        RD["Redis Cluster\nTimeline Cache · Like Counters\nStory view sets"]
        S3C["S3 Processed Bucket\n+ CDN (CloudFront)"]
        ES["Elasticsearch\nHashtag Inverted Index"]
        SGS["Social Graph\nRedis + MySQL"]
    end
    APP -->|"POST /post"| GW
    APP -->|"GET /feed"| GW
    APP -->|"GET /explore"| GW
    APP -->|"GET /hashtag/:tag"| GW
    GW --> PS
    GW --> FS
    GW --> EXS
    PS -->|"1. persist metadata"| CAS
    PS -->|"2. pre-signed URL → client uploads"| S3R
    PS -->|"3. publish event"| KF
    S3R -->|"S3 ObjectCreated event"| MP
    MP -->|"thumb · medium · hq · webp"| S3C
    KF --> FO
    FO -->|"lookup followers"| SGS
    FO -->|"ZADD post_id\nnormal users only"| RD
    FS -->|"ZRANGE timeline"| RD
    FS -->|"celebrity posts pull"| CAS
    FS -->|"merged ID list"| HY
    HY -->|"batch MGET"| CAS
    HY -->|"media URLs served"| S3C
    EXS -->|"ANN search"| VDB
    EXS -->|"feature fetch"| CAS
    EXS -->|"re-rank"| MLR
    KF -->|"post.created → index"| ES

Component Reference
#

Component	Technology	Role	Key Design Decision	Failure Behaviour
API Gateway	Nginx / Envoy	Single entry point. Validates JWT, enforces per-user rate limits, routes to the correct downstream service, terminates TLS, strips internal headers.	Rate limiting is enforced here — internal services never see raw request bursts. Upload quota (e.g. 100 posts/hour) is checked here before the client even receives a pre-signed URL.	Stateless; horizontally scaled. Node failure is transparent behind the load balancer.
Post Service	Java / Go microservice	Validates caption (2,200 chars), hashtag count (≤ 30), and file MIME type. Assigns a TIMEUUID `post_id`. Writes post metadata to Cassandra. Calls S3 to generate a pre-signed upload URL (returned to client). Publishes a `PostCreatedEvent` to Kafka. Returns immediately — media processing is async.	The client uploads media directly to S3 — the Post Service never proxies binary data. This eliminates app-tier memory pressure and allows S3 to handle multi-part uploads and resumable transfers natively.	Cassandra write failure → 503 to client. Kafka publish failure → post exists in Cassandra but fan-out delayed; a reconciliation job replays from the Cassandra WAL.
Media Processor	Python workers (Pillow / FFmpeg) on EC2 Spot	Triggered by S3 `ObjectCreated` event via SQS. Downloads the raw upload, generates five variants: thumbnail (150px), medium (640px), high-res (1080px), 4K original (stored cold), and a WebP version of each for modern browsers. For videos: generates HLS segments at 360p/720p/1080p, extracts a poster frame. Writes processed variants to the S3 Processed bucket and primes the CDN edge.	Spot Instances cut processing cost by 70% vs on-demand. Workers are idempotent (output key is deterministic from `post_id + variant`), so interrupted jobs are safe to retry. WebP conversion reduces median photo payload by 35% vs JPEG, directly cutting CDN egress cost.	Worker failure → SQS message becomes visible again after visibility timeout (30s). Max 3 retries; on DLQ after 3 failures — ops alert fires. Post is accessible but shows a processing placeholder image until the job completes.
Fan-out Service	Java 21 (virtual threads) worker pool	Consumes `post.created` events from Kafka. For each event, fetches the author's follower list from Social Graph Service. For followers below the celebrity threshold (10,000), writes the `post_id` into each follower's Redis timeline sorted set via a pipelined `ZADD`. Authors above the threshold are skipped — their posts are pulled at read time by Feed Service.	Same hybrid push/pull strategy as Twitter. Threshold is tunable without code change (config flag). Redis pipeline batching collapses per-shard writes: 5,000 followers across 30 Redis shards = 30 pipeline calls, not 5,000 round-trips.	Kafka offset committed only after successful Redis pipeline write. Worker crash replays the batch. `ZADD NX` makes replay idempotent. Consumer lag is the primary freshness health signal.
Feed Service	Java microservice	Handles all `GET /feed` requests. Fetches up to 300 `post_id`s from the user's Redis timeline sorted set. In parallel, fetches recent posts from celebrity accounts the user follows (querying Cassandra). Merges and re-sorts both lists. Applies a lightweight ranking model (engagement + recency score) to re-order the top-50 before passing to Hydration Service.	Instagram's feed is no longer purely reverse-chronological — a ranking overlay reorders content by predicted engagement. The ranking happens on the merged post ID list (not on full post objects) using precomputed engagement scores stored in Redis. This keeps ranking latency at ~5ms rather than running a full ML inference per feed request.	Redis miss (cold start) → fall back to DB reconstruction: query Social Graph for following list, then Cassandra for each followed author's recent posts. Expensive but rare. Pre-warm triggered by `user.login` Kafka event.
Hydration Service	Java microservice + Caffeine L1 cache	Takes an ordered list of `post_id`s and returns full post objects (caption, media URLs, like count, comment count, author display info). Batch `MGET` against a post object cache first; misses fall through to Cassandra. Enriches each post with author avatar and username from User Service (cached). Transforms S3 keys into signed CDN URLs.	Viral posts are fetched by millions of simultaneous feed loads. A local in-process Caffeine cache (50K entries, 60s TTL) catches hot post objects before they reach Redis or Cassandra. Singleflight pattern prevents thundering herd: only one Cassandra read fires per hot `post_id` regardless of concurrency.	Cassandra unavailable → return partial feed from cache only. Fail open (degraded feed) rather than a 503 — users see fewer posts, not an error.
Explore Service	Python ML service + Vector DB	Generates the personalised Explore grid. Step 1 (Candidate Generation): uses the user's interest embedding to run an ANN search against a Vector DB of post embeddings — returns ~500 candidate posts from non-followed accounts. Step 2 (Re-Ranking): a GBM model scores each candidate on predicted engagement (like, save, comment probability) using real-time features (recency, author engagement rate, user–category affinity). Returns top 50.	Post embeddings are computed offline by a separate embedding pipeline (daily batch + real-time updates for new posts via a Kafka consumer). The ANN search (HNSW) returns approximate neighbours in ~10ms — exact KNN over 50B posts would be infeasible. Explore results are cached per user for 15 minutes (Redis key: `explore:{userId}`) to avoid running the full ML pipeline on every Explore tab open.	Vector DB unavailable → fall back to trending posts by category (pre-computed hourly, served from Redis). Ranking model failure → serve candidates unranked. Explore is a non-critical path — degradation is tolerated.
Post Store (Cassandra)	Apache Cassandra 4.x	Canonical store for all post metadata. Partitioned by `(author_id, bucket)` (bucket = `YYYYMM`), clustered by `post_id DESC` (TIMEUUID). Enables efficient "get this author's last N posts" queries — used by Feed Service for celebrity pulls and by profile page loads. Stories stored in a separate table with a TTL of 86,400 seconds (24 hours).	Monthly bucket prevents hot partitions for prolific accounts. Story TTL is enforced by Cassandra natively — no separate cleanup job needed. Counter columns (`like_count`, `comment_count`) live in separate counter tables due to Cassandra's COUNTER type restriction.	Node failure → RF=3 LOCAL_QUORUM masks it transparently. Multi-DC replication for regional HA. Read latency p99 target: 10ms for single-partition reads.
Timeline Cache (Redis)	Redis Cluster (sorted sets)	Each user's home feed stored as `timeline:{userId}` — a sorted set of `post_id`s scored by creation timestamp. Capped at 500 entries. Like counters stored as Redis Strings with INCR. Engagement score cache stored as sorted set `engagement:{postId}` for the ranking overlay. Story view sets stored as `story_viewers:{storyId}` sorted sets (TTL: 26h).	Consolidated Redis usage: timeline, like counters, engagement scores, and story view tracking all live in the same cluster, separated by key prefix. This avoids running multiple Redis clusters with different SLOs. Memory budgeted at: 500 IDs × 8B × 500M users ≈ 2TB for timelines alone — requires aggressive TTL eviction for inactive users (7-day inactivity → TTL set, key evicted).	Node failure → Redis Cluster replica promotion (~seconds). Short miss storm absorbed by singleflight + async reconstruction. Eviction policy: allkeys-lru with volatile-lru for TTL keys.
Media Store (S3 + CDN)	S3 + CloudFront / Fastly	All processed media variants stored in S3 with content-addressed keys: `media/{post_id}/{variant}.webp`. Immutable after write. CDN serves all media — clients never hit S3 directly. CDN TTL is indefinite for immutable media keys. Pre-signed CDN URLs (signed with a secret key, 1h expiry) prevent hotlink abuse and unauthorised access to private account media.	Content-addressed keys enable infinite CDN TTL — the key never changes after processing. Origin-shield layer (CloudFront regional cache) absorbs 90%+ of origin requests. 4K originals are stored in S3 Glacier Instant Retrieval — retrieved only for download requests or re-processing.	CDN node failure → CloudFront routes to alternate PoP. S3 unavailable → CDN serves cached variant for existing content; new uploads fail gracefully with client retry. Story media deleted from S3 after 48h (CDN TTL 24h + 24h buffer for propagation).
Hashtag Index (Elasticsearch)	Elasticsearch / OpenSearch	Inverted index: each hashtag maps to a list of `post_id`s with a composite score (recency + engagement). A Kafka consumer reads `post.created` events and indexes each hashtag extracted from the caption within seconds of upload. Queries: `GET /hashtag/travel` → Elasticsearch term query on `hashtags` field, sorted by score desc, paginated with search_after cursor.	Elasticsearch is used only for hashtag search and full-text caption search — not for feed assembly (which uses Redis). This separation prevents search query spikes from impacting feed latency. Index sharded by hashtag hash to distribute write load evenly. Popular hashtags (#love, #instagood) have 2B+ posts — queries are capped at 10K results with cursor pagination.	Elasticsearch node failure → replica shard takes over. Indexing lag (Kafka consumer backlog) means new posts may not appear in hashtag search for up to 30s — acceptable eventual consistency. ES cluster isolated from feed path so degradation is scoped to search only.

5. Deep Dive
#

Media Upload Pipeline — Processing Before the First View
#

The media pipeline is Instagram’s most operationally intensive subsystem, and the most Instagram-specific part of the architecture. Unlike Twitter (text-first), every Instagram post triggers a processing job before it can be served. Getting this wrong means either serving oversized originals to mobile clients (destroying CDN cost and load time) or blocking the user-facing write path on slow transcoding work.

The key principle is client-direct-to-S3 upload with async processing. The Post Service never touches binary data:

Client                Post Service         S3 Raw          Media Processor       S3 Processed
  │                        │                  │                    │                    │
  │──POST /post ──────────>│                  │                    │                    │
  │<──200 {post_id,        │                  │                    │                    │
  │      upload_url} ──────│                  │                    │                    │
  │                        │                  │                    │                    │
  │──PUT {upload_url} ─────────────────────>  │                    │                    │
  │<──200 (ETag) ──────────────────────────── │                    │                    │
  │                        │                  │──ObjectCreated──> SQS ──────────────── >│
  │                        │                  │                    │ download + process  │
  │                        │                  │                    │──write variants───>  │
  │                        │                  │                    │──prime CDN edge ───>  │

The pre-signed URL has a 15-minute expiry and is scoped to a single S3 key. S3 enforces the content-length limit on the server side, so the Post Service never needs to buffer or validate the upload byte-stream. Multi-part upload (for videos > 5MB) is handled entirely by the S3 SDK on the client — the Post Service only provides the initial URL.

The Media Processor runs on EC2 Spot Instances behind an SQS queue (not triggered by Lambda, because transcoding 90-second Reels at 60fps exceeds Lambda’s 15-minute limit). Each worker processes one upload atomically:

# Media Processor — Python worker (simplified)
import boto3, PIL.Image, subprocess, pathlib

VARIANTS = {
    "thumb":   (150, 150),
    "medium":  (640, None),   # None = maintain aspect ratio
    "hq":      (1080, None),
    "original": None,          # stored cold in Glacier
}

def process_photo(post_id: str, raw_key: str):
    raw_path = download_from_s3(RAW_BUCKET, raw_key)
    img = PIL.Image.open(raw_path).convert("RGB")

    for variant, dims in VARIANTS.items():
        if dims is None:
            upload_to_s3(
                PROCESSED_BUCKET,
                f"media/{post_id}/original.jpg",
                raw_path,
                storage_class="GLACIER_IR"
            )
            continue

        w, h = dims
        resized = img.resize(
            (w, int(img.height * w / img.width)) if h is None else (w, h),
            PIL.Image.LANCZOS
        )
        webp_path = f"/tmp/{post_id}_{variant}.webp"
        resized.save(webp_path, "WEBP", quality=82, method=6)

        s3_key = f"media/{post_id}/{variant}.webp"
        upload_to_s3(PROCESSED_BUCKET, s3_key, webp_path)
        prime_cdn_edge(s3_key)   # CloudFront CreateInvalidation + warm prefetch

def process_video(post_id: str, raw_key: str):
    raw_path = download_from_s3(RAW_BUCKET, raw_key)
    for resolution in ["360p", "720p", "1080p"]:
        hls_dir = transcode_to_hls(raw_path, resolution)   # FFmpeg, H.265
        upload_hls_segments(PROCESSED_BUCKET, f"media/{post_id}/{resolution}/", hls_dir)
    extract_poster_frame(raw_path, f"media/{post_id}/poster.webp")

The prime_cdn_edge call issues a CloudFront cache warm request immediately after upload. This ensures the first follower to load the photo hits the CDN edge, not the S3 origin. Without this, a viral post published to millions of followers simultaneously would saturate the S3 origin with cache-miss requests.

WebP conversion is the single highest-leverage optimisation. At Instagram’s scale, switching from JPEG to WebP at equivalent visual quality (SSIM 0.95) reduces median photo payload from ~200KB to ~130KB — a 35% reduction. At 46,300 feed reads/s each serving 20 photos, that’s a 335 GB/s egress reduction that translates directly to CDN cost savings.

Feed Generation — Hybrid Fan-Out with Ranking Overlay
#

Instagram’s feed is structurally identical to Twitter’s at the infrastructure layer (hybrid push/pull with a celebrity threshold), but adds a ranking step that Twitter’s original chronological feed didn’t have. The ranking challenge is doing this at 46,300 requests/second without running a full ML inference per request.

The solution is pre-computed engagement scores. A background ML pipeline runs every 5 minutes and writes a score for each post into Redis:

engagement_score:{post_id}  →  float  (probability-weighted engagement score, 0-1)

The score combines recency decay, author engagement rate, category preference for the requesting user’s interest vector, and recent like/comment velocity. The Feed Service retrieves this score via a batch MGET alongside the post IDs, then applies a weighted merge with the timestamp score:

final_score = 0.6 × engagement_score + 0.4 × recency_score

The recency score is computed inline in the Feed Service from the timestamp embedded in the TIMEUUID — no external call needed. The merged score list is sorted in-memory (top 300 posts, O(N log N) trivial at N=300), and the top 50 are passed to Hydration.

This approach keeps the feed assembly latency at ~40ms total (Redis ZRANGE + batch MGET for scores + sort) versus ~200ms for a real-time ML inference call. The trade-off is that engagement scores are up to 5 minutes stale — a post that goes viral in the last 5 minutes may not immediately get its elevated score. This is an acceptable product compromise.

The celebrity pull works identically to Twitter’s: for each celebrity the user follows, the Feed Service queries Cassandra for their most recent posts and merges them into the timeline using a standard merge-sort step. The celebrity_following:{userId} Redis set (maintained asynchronously on follow/unfollow) makes the celebrity identity check O(1) rather than requiring a follower count lookup for every account in the following list.

Explore Page — Candidate Generation and ML Re-Ranking
#

The Explore page is architecturally distinct from the feed because it serves discovery: posts from accounts the user has never followed. This requires a fundamentally different candidate generation strategy — you cannot fan-out from unknown authors.

The pipeline has two stages:

Stage 1 — Candidate Generation (ANN Search): Every post gets an embedding vector (128 dimensions) computed by a vision model (ResNet + text encoder for caption). These embeddings are ingested into a Vector DB (Meta’s Faiss with HNSW index, or a managed service like Pinecone). When a user opens Explore, their interest vector (computed from their like/save/comment history, updated daily) is used as the query vector. An ANN search returns the 500 most semantically similar posts across the 50B post corpus in ~10ms.

Stage 2 — Re-Ranking (GBM Model): The 500 candidates are scored by a Gradient Boosted Machine model that incorporates:

Post-level features: recency, author historical engagement rate, category
User-level features: user–category affinity scores, device type (images vs Reels preference)
Cross features: user interest vector × post embedding dot product (cosine similarity)

The GBM scores 500 candidates in ~20ms (pre-loaded model in-process). Top 50 are returned to the client as the Explore grid.

Caching: Explore results are cached per user for 15 minutes in Redis (explore:{userId}). This means the full ML pipeline runs at most once per 15 minutes per active user — at 500M DAU with 8 Explore opens/day, that’s ~46,300 ML pipeline runs/second at peak, reduced to ~1,550/second with the 15-minute cache. Without caching, the Vector DB and ML infrastructure would need to be 30× larger.

Hashtag Index — Real-Time Inverted Index at Petabyte Scale
#

Instagram’s hashtag search must index ~116 million new posts per day across ~2 billion distinct hashtags, while keeping popular hashtag queries under 100ms.

The index is built on Elasticsearch with a custom document schema:

{
  "post_id":        "01HXYZ...",
  "author_id":      12345,
  "hashtags":       ["travel", "photography", "italy"],
  "caption":        "Golden hour in Rome #travel #photography #italy",
  "created_at":     "2026-04-24T10:00:00Z",
  "engagement_score": 0.82,
  "is_private":     false
}

A Kafka consumer (hashtag-indexer) reads post.created events and bulk-indexes documents into Elasticsearch with a 1-second flush interval. At 1,160 posts/second average, this is ~70K documents/minute — well within Elasticsearch’s write capacity at 6-node clusters.

The critical design decision is dual-sort support: queries must support both sort_by=recency (for the “Recent” tab) and sort_by=top (for the “Top” tab). The engagement_score field (precomputed by the background ML pipeline and updated via partial document updates) enables sort_by=top without re-ranking at query time. The created_at field handles recency sort.

Hot hashtag mitigation: #love has 2B+ posts. A naive match query on this term would attempt to score 2B documents. Mitigations:

search_after cursor pagination caps result set traversal.
Index aliases by hashtag popularity tier: ultra-popular hashtags are indexed in a separate, smaller index with aggressive document eviction (only posts from the last 7 days indexed for #love — older posts are too stale to surface usefully).
Results for popular hashtags are cached in Redis for 30 seconds (hashtag_top:{tag}:{page}).

6. Data Model
#

Post Table (Cassandra)
#

Column	Type	Notes
author_id	BIGINT	Partition key
bucket	TEXT	Partition key (`YYYYMM`) — prevents hot partitions for prolific creators
post_id	TIMEUUID	Clustering key DESC — reverse-chron scan without sort
caption	TEXT	Max 2,200 characters
hashtags	LIST<TEXT>	Extracted at write time; also indexed in Elasticsearch
media_keys	LIST<TEXT>	S3/CDN keys by variant: `media/{post_id}/{variant}.webp`
location_id	BIGINT	NULL if no location tag
is_private	BOOLEAN	Controls fan-out eligibility
is_deleted	BOOLEAN	Soft delete; content cleared on GDPR erasure
created_at	TIMESTAMP	Denormalised from post_id for display

Like/Comment counters (separate Cassandra table):

Column	Type	Notes
post_id	TIMEUUID	Partition key
like_count	COUNTER	Cassandra COUNTER type; commutative INCR
comment_count	COUNTER	Same
save_count	COUNTER	Same

Story Table (Cassandra — with TTL)
#

Column	Type	Notes
author_id	BIGINT	Partition key
story_id	TIMEUUID	Clustering key DESC
media_key	TEXT	S3/CDN key
expires_at	TIMESTAMP	Set to `created_at + 86400s`
view_count	INT	Denormalised from Redis at story close

Row-level TTL of 86,400 seconds is set on INSERT. Cassandra purges the row automatically — no cron job required.

Timeline Cache (Redis)
#

Key:   timeline:{userId}
Type:  Sorted Set
Score: unix timestamp (milliseconds) — from TIMEUUID
Value: post_id
Cap:   500 entries (ZREMRANGEBYRANK after each ZADD)
TTL:   7 days for inactive users (EXPIRE reset on each feed read)
       No TTL for active users (kept hot by fan-out writes)

Story View Tracking (Redis)
#

Key:   story_viewers:{storyId}
Type:  Sorted Set
Score: view timestamp (ms)
Value: viewer_id
TTL:   26 hours (story 24h + 2h buffer for "who viewed" queries post-expiry)
Max:   5,000 entries — truncated for viral stories (approximated with HyperLogLog beyond threshold)

Social Graph (Redis + MySQL)
#

Redis (fan-out hot path):
  followers:{userId}         → SET  { followerId, ... }
  following:{userId}         → SET  { followeeId, ... }
  celebrity_following:{userId} → SET { celebId, ... }  (followee follower_count > 10K)

MySQL (source of truth):
  follows(
    follower_id  BIGINT NOT NULL,
    followee_id  BIGINT NOT NULL,
    created_at   TIMESTAMP DEFAULT NOW(),
    PRIMARY KEY  (follower_id, followee_id),
    INDEX        idx_followee (followee_id, follower_id)
  )

7. Trade-offs
#

Fan-Out on Write vs Read for Instagram
#

Approach	Pros	Cons	When to Use
Fan-out on Write (push)	O(1) reads; timeline always pre-built; predictable latency	Write amplification: 1 post → N Redis writes; celebrity accounts become O(100M)	Accounts with < 10K followers
Fan-out on Read (pull)	No write amplification; always fresh; handles celebrities naturally	O(F) reads per feed load; cold timelines expensive	Celebrities (> 10K followers); inactive users
Hybrid	Best of both for common case	Celebrity merge complexity; requires `celebrity_following` set maintenance	Always — this is the production answer

Instagram vs Twitter: Instagram’s celebrity threshold can be lower than Twitter’s (some analyses suggest ~5,000) because Instagram posts are less frequent (users post once vs tweet 20 times/day). Lower post frequency means a smaller Redis sorted set per follower even with a lower threshold — the memory savings from push are less dramatic, so it’s safe to lower the threshold and improve celebrity read-time pull costs.

Synchronous vs Async Media Processing
#

Option	Pros	Cons
Sync (process before returning post_id)	Client gets processed media URLs immediately	POST /post latency includes transcoding (seconds for video) — unacceptable
Async (return post_id immediately, process in background)	POST /post is fast (~50ms); processing scales independently on Spot	Client must poll for “processing complete” status; feed may briefly show placeholder

Conclusion: Async always. The placeholder UX (grey thumbnail → replaced by real image within seconds) is universally preferred over a 5-second blocking upload API.

Redis Sorted Set vs List for Timeline
#

Option	Pros	Cons
Sorted Set	O(log N) insert; natural timestamp-ordered merge with celebrity posts; supports ranking score overlay	~64 bytes/entry vs ~24 for List; higher memory
List	O(1) prepend; lower memory per entry	Celebrity post merge requires full load + re-sort (O(N log N) per request)

Conclusion: Sorted Set. The celebrity merge and ranking score overlay both require ordered set semantics that a List cannot efficiently provide.

Elasticsearch vs Cassandra for Hashtag Search
#

Option	Pros	Cons
Elasticsearch	Native inverted index; full-text search; relevance scoring; aggregations	Higher operational complexity; not suited for primary post storage
Cassandra	Already in the system; simple	No inverted index support; hashtag queries would require full table scans or a materialised view with extreme fan-out

Conclusion: Elasticsearch for search, Cassandra for post storage. They serve different access patterns and should not be conflated.

8. Failure Modes
#

Component	Failure	Impact	Mitigation
Media Processor (Spot eviction)	Worker terminated mid-transcode	Post exists in Cassandra but no processed media; feed shows placeholder	SQS visibility timeout re-queues the job. On-demand fallback worker pool handles DLQ messages within 5 minutes.
Fan-out Service crash	Mid-fan-out batch failure	Some followers see post immediately, others see it late	Kafka offset not committed → batch replays. ZADD NX makes replay idempotent. Lag alert fires at > 10K events.
Redis Timeline node failure	Cache miss storm on the affected shard	Timeline reads fall back to DB reconstruction for affected users	Redis Cluster promotes replica in seconds. Singleflight pattern collapses concurrent reconstruction calls per user.
CDN cache miss on viral post	Millions of requests hit S3 origin simultaneously	S3 throttling, increased latency	Origin-shield (regional CloudFront cache) absorbs traffic before it reaches S3. Media Processor primes CDN edge on upload — first follower load is always a CDN hit, not origin.
Elasticsearch indexing lag	Kafka consumer backlog	New posts missing from hashtag search for up to 30s	Acceptable eventual consistency for search. Feed path is completely independent of ES, so feed latency is unaffected.
Story TTL race condition	Story viewed 1 second after expiry	User sees expired story briefly	Cassandra TTL purges the row server-side; subsequent fetches return 404. CDN serves cached story media for up to 24h after CDN TTL expires — add Cache-Control: max-age=86400 on story media + CDN soft-purge on expiry.
Like counter hot key	Post going viral: 100K likes/second on one key	Redis INCR on single key saturates one shard	Shard the counter: `likes:{post_id}:{shard}` where shard = rand(0, 16). Aggregate on read with MGET + sum. Write throughput scales linearly with shard count.
Explore Vector DB unavailable	ANN search fails	Explore tab returns fallback (trending by category)	Pre-computed hourly trending grids per category stored in Redis serve as graceful degradation. Explore is non-critical; 5-minute stale fallback is acceptable.

9. Security & Compliance
#

AuthN/AuthZ: OAuth 2.0 with short-lived JWTs (15-min access token, 7-day refresh with rotation). API Gateway validates the JWT and injects userId into downstream request headers. Private account posts are fan-outed only to approved followers: the Fan-out Service checks is_private and the approved_followers:{userId} set before writing to each follower’s timeline. Media for private accounts is served via signed CDN URLs (1-hour expiry, signed with a per-account HMAC key) — unapproved users cannot access media even with a direct URL.

Encryption: TLS 1.3 for all in-transit data (client to gateway, service to service via mutual TLS). S3 buckets encrypted with SSE-S3 (AES-256). Redis encrypted at rest via encrypted EBS. KMS-managed keys with automatic 90-day rotation. Media CDN URLs are signed to prevent hotlink abuse.

Input Validation: Caption sanitised server-side (XSS stripping, Unicode normalisation, null byte removal). Hashtag extraction uses a whitelist character set (letters, numbers, underscores — no control characters). Media MIME type validated against magic bytes (not file extension) before the pre-signed URL is issued. Image dimensions capped at 20,000px per side to prevent decompression bomb attacks (gigapixel PNGs that expand to GB in memory).

Rate Limiting: Upload: 100 posts/24h per user (sliding window, token bucket in Redis). Like: 500/hour (burst allowance). Feed reads: 200/hour per user. These are enforced at the API Gateway. Automated bot detection (beyond rate limits) uses a separate ML classifier on user-agent, request pattern, and content hashes.

GDPR Right to Erasure: User deletion triggers async propagation: post metadata in Cassandra is soft-deleted (is_deleted=true, caption cleared), media is deleted from S3 (CDN TTL expiry propagates within 24h), post IDs are removed from all follower timeline sorted sets via a reverse fan-out using Social Graph. Elasticsearch documents are deleted by author_id query. Vector DB embeddings are deleted by author_id. Full propagation SLA: 30 days (aligning with Kafka retention for event replay coverage).

CSAM Detection: PhotoDNA hash matching runs on every uploaded image asynchronously (within 5 seconds of upload). Post is soft-held from fan-out until the CSAM scan returns clean. Videos are scanned frame-by-frame on a sampling schedule. Matches are reported to NCMEC and the post/account is hard-deleted.

10. Observability
#

RED Metrics
#

Metric	Alert Threshold
Feed read latency p99	> 300ms → page
Feed read latency p50	> 80ms → warn
Media upload success rate	< 99.5% → page
Media processing lag (SQS depth)	> 500 jobs → warn; > 5,000 → page
Fan-out Kafka consumer lag	> 10K events → warn; > 100K → page
Timeline cache hit rate	< 90% → warn; < 80% → page
Explore ML pipeline p99	> 150ms → warn
Story expiry accuracy	> 5s lag → warn

Saturation Metrics
#

Redis memory utilisation per shard: warn at 70%, page at 85%
S3 processed bucket egress: warn at 80% of region quota
Cassandra read latency p99 per node: warn at 20ms, page at 50ms
Elasticsearch indexing rate: warn at 80% of cluster write capacity
Media Processor SQS queue depth (per-worker): warn at 20 jobs queued per worker

Business Metrics
#

Posts per second (rolling 1-min window) — deviation > 25% from baseline triggers anomaly alert
Feed impression rate per user per day (measures engagement health)
Fan-out amplification ratio: (fan-out Redis writes) / (posts created) — should track average follower count; sudden drop indicates fan-out workers falling behind
Media processing success rate (photo vs video separately) — video failures spike during EC2 Spot eviction storms
Explore click-through rate (tracks ML model quality — threshold: CTR drop > 5% over 1h → page ML oncall)

Tracing
#

OpenTelemetry with tail-based sampling: 10% of normal requests, 100% of requests with any span > 200ms, 100% of errors. Key trace spans: post.create → s3.presign → kafka.publish, and separately: s3.objectcreated → media.download → media.resize → cdn.prime. Feed path: feed.read → redis.zrange → celebrity.pull → hydration.mget → cdn.url_sign. Jaeger for trace storage; Grafana dashboards for RED metrics and fan-out health.

11. Scaling Path
#

Phase 1 — MVP (< 1K RPS reads)
#

Single PostgreSQL DB. Media resized synchronously on upload in the API server process. No fan-out — timeline is a SELECT * FROM posts WHERE author_id = ANY($following_list) ORDER BY created_at DESC LIMIT 20. Simple, correct, fast enough at low scale.

What breaks first: DB CPU at ~5K timeline reads/s when following-list queries hit 200+ accounts each. Media resizing in-process causes POST /post to take 3-5 seconds for photos. App server OOMs on concurrent uploads.

Phase 2 — 10K RPS reads
#

Move media processing to async workers (SQS + Python workers). Migrate posts to Cassandra. Add Redis for timeline caching with synchronous fan-out. Add read replicas for Social Graph MySQL.

What breaks first: Synchronous fan-out causes POST /post latency to spike for users with many followers. First celebrity account (> 50K followers) causes tweet API p99 to degrade to seconds.

Phase 3 — 100K RPS reads
#

Async fan-out via Kafka. Celebrity threshold (10K followers → pull model). Redis Cluster. Multi-DC Cassandra replication. Separate Hydration Service tier. Elasticsearch for hashtag search (replaces Cassandra ALLOW FILTERING queries). WebP conversion in Media Processor.

What breaks first: Redis memory. 500M users × 500 IDs × 8 bytes ≈ 2TB. Aggressive TTL eviction for inactive users. Explore page at this scale is a simple trending grid — personalised ML not yet feasible.

Phase 4 — 1M+ RPS reads
#

Personalised Explore ML pipeline (Vector DB + GBM ranker). Tiered media storage (hot S3 → warm S3-IA → cold Glacier IR for originals). Geo-distributed fan-out workers. CDN origin-shield layer. Engagement score pre-computation for feed ranking overlay. Pre-signed CDN URLs for private account media. Story view tracking with HyperLogLog for viral stories (beyond 5,000 viewers). Predictive CDN pre-warming: ML predicts which posts will go viral within 1h of upload based on author engagement history.

12. Enterprise Considerations
#

Brownfield Integration: Migrating from a monolith, use Strangler Fig — route POST /upload and GET /feed to new services while legacy handles everything else. Media can be backfilled from existing storage by re-running the Media Processor pipeline on raw originals. Timeline Redis can be bootstrapped from a one-time Cassandra scan (read each user’s following list, fetch last 500 posts from each followed author, build sorted sets). Run backfill during off-peak hours at throttled throughput.

Build vs Buy:

Media Processing: Build the orchestration (SQS → worker → S3), buy the libraries (Pillow, FFmpeg, libvips). Do not build a transcoding engine.
Fan-out: Build (domain-specific fan-out rules — celebrity threshold, private account gating, story fan-out TTL — are too specific for off-the-shelf).
Hashtag Search: Buy Elasticsearch/OpenSearch (Confluent Cloud for managed Kafka to feed it). Do not build an inverted index.
Vector DB: Buy (Pinecone, Weaviate, or Milvus) for Explore. Building HNSW at 50B scale is a team-year project.
CDN: Cloudflare or CloudFront. Never build a CDN.
Object Storage: S3/GCS. Never build this.

Multi-Tenancy (B2B variant): For a white-label social platform, isolate at the Redis key prefix and Cassandra keyspace level: timeline:{tenantId}:{userId}, separate Kafka topics per tenant tier. Fan-out workers use separate consumer groups per tenant to prevent noisy-neighbour fan-out from affecting other tenants’ freshness SLOs.

TCO Ballpark (100K RPS reads):

Component	Config	Est. Monthly Cost
Redis Cluster (timelines + counters)	6× r7g.4xlarge (122GB RAM each)	~$12,000
Cassandra cluster	8× i4i.4xlarge (NVMe SSD)	~$16,000
Kafka (MSK)	3 brokers m5.2xlarge	~$3,000
Media Processor (Spot)	50× c7g.2xlarge (70% spot)	~$5,000
Elasticsearch	6× r6g.2xlarge	~$6,000
S3 + CDN egress	131TB/day storage + ~74GB/s peak egress	~$40,000
Total		~$82,000/mo

CDN egress is the dominant cost line at scale — not compute. WebP conversion and CDN cache optimisation directly reduce this. At 1M+ RPS, CDN egress routinely exceeds all compute costs combined.

Conway’s Law: Upload, Feed, Explore, and Stories should each be owned by separate teams. The fan-out amplification ratio is a shared KPI owned jointly by the Upload and Feed teams — it’s the single number that best reflects the health of the write-read contract. The CDN origin hit rate is owned jointly by Media and Infrastructure — it’s the cost/performance lever that transcends both team boundaries.

13. Interview Tips
#

Start with the upload pipeline. Most interviewers expect you to jump straight to feed generation, but the media pipeline is what makes Instagram different from Twitter. Mention pre-signed S3 upload URL and async processing in your first 2 minutes — it immediately signals Instagram-specific depth.
Challenge the synchronous upload assumption. If you sketch a design where the API server resizes photos before returning a 200, the interviewer will probe until you hit the latency problem. Get there first: “Photo resizing takes ~500ms, video transcoding minutes — this has to be async.” Then describe the placeholder UX.
Fan-out is the same as Twitter — acknowledge it and move fast. Don’t spend 10 minutes re-explaining push/pull hybrid. Say “same hybrid fan-out as a Twitter feed, celebrity threshold at 10K” and move to Instagram’s unique challenges: media pipeline, ranking overlay, Explore, Stories, hashtag index.
Explore is a two-stage pipeline. Many candidates say “recommendation system” without explaining how you generate candidates from a 50B-post corpus without scanning it all. The ANN search over post embeddings is the key insight. Name HNSW and approximate nearest neighbour — it signals you know the tradeoff between recall and latency.
Story expiry is a Cassandra TTL question in disguise. The interview question “how do you expire stories after 24 hours?” is really asking whether you know about Cassandra’s native TTL, Redis key TTL, and scheduled job patterns. Cassandra TTL is the cleanest answer — no external cron job, no at-scale delete storms.
Like counter hot key is the concurrency question. A post with 1M likes in 1 hour generates ~280 INCR operations/second on a single Redis key — that’s fine. A post with 10M likes/hour is 2,800/second — still fine. But a viral post during the Super Bowl could hit 1M/minute (16,667/second) on a single key. Introduce counter sharding (likes:{post_id}:{shard}) before the interviewer asks.
Vocabulary that signals fluency: fan-out amplification ratio, pre-signed upload URL, content-addressed media keys, HLS adaptive bitrate, interest embedding, ANN search, HNSW, engagement score pre-computation, singleflight pattern, Cassandra TTL for Stories, WebP egress optimisation.

14. Further Reading
#

“Scaling Instagram Infrastructure” (Instagram Engineering Blog) — Instagram’s original 2012 talk on moving from PostgreSQL to Cassandra, and their 2017 talk on moving to a single Python Django monolith deliberately to simplify operations.
“Unicorn: A System for Searching the Social Graph” (Meta, VLDB 2019) — How Meta handles social graph queries at multi-billion-node scale; directly applicable to the Social Graph Service design.
“EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks” (Google, 2019) — The architecture behind modern image embedding models used for Explore candidate generation.
“FAISS: A Library for Efficient Similarity Search” (Meta AI) — The ANN library underlying Instagram’s Explore candidate generation pipeline; covers HNSW and IVF index tradeoffs.
Martin Kleppmann, “Designing Data-Intensive Applications” — Chapter 3 covers LSM trees (Cassandra storage engine); Chapter 11 covers Kafka stream processing for fan-out and hashtag indexing pipelines.
“Scaling to 100M Users” (High Scalability blog) — Instagram’s early architectural decisions and why a small team was able to scale to 100M users with ~13 engineers.

1. Hook #

2. Problem Statement #

Functional Requirements #

Non-Functional Requirements #

Out of Scope #

3. Scale Estimation #

4. High-Level Design #

Component Reference #

5. Deep Dive #

Media Upload Pipeline — Processing Before the First View #

Feed Generation — Hybrid Fan-Out with Ranking Overlay #

Explore Page — Candidate Generation and ML Re-Ranking #

Hashtag Index — Real-Time Inverted Index at Petabyte Scale #

6. Data Model #

Post Table (Cassandra) #

Story Table (Cassandra — with TTL) #

Timeline Cache (Redis) #

Story View Tracking (Redis) #

Social Graph (Redis + MySQL) #

7. Trade-offs #

Fan-Out on Write vs Read for Instagram #

Synchronous vs Async Media Processing #

Redis Sorted Set vs List for Timeline #

Elasticsearch vs Cassandra for Hashtag Search #

8. Failure Modes #

9. Security & Compliance #

10. Observability #

RED Metrics #

Saturation Metrics #

Business Metrics #

Tracing #

11. Scaling Path #

Phase 1 — MVP (< 1K RPS reads) #

Phase 2 — 10K RPS reads #

Phase 3 — 100K RPS reads #

Phase 4 — 1M+ RPS reads #

12. Enterprise Considerations #

13. Interview Tips #

14. Further Reading #

Related