1. Hook #
Instagram processes 100 million photo and video uploads every day, serves 4.2 billion likes, and delivers personalised feeds to 500 million daily users — all while keeping image loads under 200ms anywhere in the world. The engineering challenge is three-layered: a media processing pipeline that converts every raw upload into five optimised variants before the first follower ever sees it; a hybrid fan-out feed that handles both 400-follower personal accounts and 300-million-follower celebrities without write amplification blowing up; and an Explore page that must surface genuinely relevant content from a corpus of 50 billion posts to users who have never explicitly stated what they want. Each layer has a distinct bottleneck, and solving one often creates pressure on the others.
2. Problem Statement #
Functional Requirements #
- Users can upload photos and short videos (up to 90-second Reels).
- Users can follow/unfollow other users.
- Home feed shows posts from followed accounts in ranked reverse-chronological order.
- Stories: ephemeral 15-second clips/photos that auto-expire after 24 hours; poster can see who viewed.
- Explore page: personalised grid of posts from non-followed accounts based on interest graph.
- Hashtag search: query
#tag, returns posts sorted by recency or engagement score. - Users can like, comment, and save posts.
Non-Functional Requirements #
| Attribute | Target |
|---|---|
| Feed read latency (p99) | < 300ms |
| Photo load latency (p99) | < 200ms (CDN-served) |
| Upload availability | 99.95% |
| Feed availability | 99.99% |
| Story expiry precision | < 5s after 24h TTL |
| Scale | 500M DAU, 100M uploads/day |
Out of Scope #
- Direct Messages (Instagram DMs)
- Live streaming
- Ad targeting and auction
- Content moderation pipeline
3. Scale Estimation #
Assumptions:
- 500M DAU; average user views feed ~8×/day, uploads ~0.2 posts/day.
- Average followers: 300; celebrity threshold: 10,000 followers.
- Photo sizes after processing: thumbnail (150×150 ~10KB), medium (640px ~80KB), high-res (1080px ~300KB).
- 80% of uploads are photos; 20% are videos averaging 5MB after transcode.
- Stories: 500M per day, ~2MB average.
| Metric | Calculation | Result |
|---|---|---|
| Post writes/s | 100M / 86,400 | ~1,160/s |
| Feed reads/s | 500M × 8 / 86,400 | ~46,300/s |
| Photo media storage/day | 80M × (10 + 80 + 300) KB | ~31 TB/day |
| Video media storage/day | 20M × 5MB | ~100 TB/day |
| Total media ingress/day | ~131 TB/day | |
| CDN egress | 46,300 reads × 20 images × 80KB avg | ~74 GB/s |
| Like writes/s | 4.2B / 86,400 | ~48,600/s |
| Fan-out Redis writes/s | 1,160 × 300 avg followers | ~348,000/s |
| Stories storage/day | 500M × 2MB | ~1 PB/day (raw; tiered to cold after 24h) |
Media storage and CDN egress dominate cost. At 131TB/day raw uploads, annual storage grows ~47 PB/year before deduplication and cold-tier archival. This is why Instagram uses aggressive transcoding (WebP for photos, H.265 for video) to cut delivery size by 30-50%.
4. High-Level Design #
The system decomposes into four independent planes: an upload plane (ingest → process → distribute media), a feed plane (fan-out on write + ranked assembly on read), an explore plane (ML candidate generation + re-ranking), and a stories plane (ephemeral ingest with TTL-based expiry).
flowchart TD
subgraph CL["Client Layer"]
APP["Mobile / Web App"]
end
subgraph AL["API Layer"]
GW["API Gateway\nAuth · Rate Limit · Routing"]
end
subgraph UP["Upload Plane"]
PS["Post Service\nValidate · Persist · Publish"]
S3R["S3 Raw Bucket\n(pre-signed upload URL)"]
MP["Media Processor\nResize · Transcode · WebP"]
end
subgraph FP["Feed Plane — Write"]
KF["Kafka\npost.created · story.created"]
FO["Fan-out Service\nWorker Pool (Java 21 vthreads)"]
end
subgraph RP["Feed Plane — Read"]
FS["Feed Service\nFetch · Merge · Rank"]
HY["Hydration Service\nBatch-fetch post objects"]
end
subgraph EX["Explore Plane"]
EXS["Explore Service"]
MLR["ML Ranker\nTwo-Tower + GBM"]
VDB["Vector DB\nEmbeddings (HNSW)"]
end
subgraph SL["Storage Layer"]
CAS["Cassandra\nPosts · Comments · Stories"]
RD["Redis Cluster\nTimeline Cache · Like Counters\nStory view sets"]
S3C["S3 Processed Bucket\n+ CDN (CloudFront)"]
ES["Elasticsearch\nHashtag Inverted Index"]
SGS["Social Graph\nRedis + MySQL"]
end
APP -->|"POST /post"| GW
APP -->|"GET /feed"| GW
APP -->|"GET /explore"| GW
APP -->|"GET /hashtag/:tag"| GW
GW --> PS
GW --> FS
GW --> EXS
PS -->|"1. persist metadata"| CAS
PS -->|"2. pre-signed URL → client uploads"| S3R
PS -->|"3. publish event"| KF
S3R -->|"S3 ObjectCreated event"| MP
MP -->|"thumb · medium · hq · webp"| S3C
KF --> FO
FO -->|"lookup followers"| SGS
FO -->|"ZADD post_id\nnormal users only"| RD
FS -->|"ZRANGE timeline"| RD
FS -->|"celebrity posts pull"| CAS
FS -->|"merged ID list"| HY
HY -->|"batch MGET"| CAS
HY -->|"media URLs served"| S3C
EXS -->|"ANN search"| VDB
EXS -->|"feature fetch"| CAS
EXS -->|"re-rank"| MLR
KF -->|"post.created → index"| ES
Component Reference #
| Component | Technology | Role | Key Design Decision | Failure Behaviour |
|---|---|---|---|---|
| API Gateway | Nginx / Envoy | Single entry point. Validates JWT, enforces per-user rate limits, routes to the correct downstream service, terminates TLS, strips internal headers. | Rate limiting is enforced here — internal services never see raw request bursts. Upload quota (e.g. 100 posts/hour) is checked here before the client even receives a pre-signed URL. | Stateless; horizontally scaled. Node failure is transparent behind the load balancer. |
| Post Service | Java / Go microservice | Validates caption (2,200 chars), hashtag count (≤ 30), and file MIME type. Assigns a TIMEUUID post_id. Writes post metadata to Cassandra. Calls S3 to generate a pre-signed upload URL (returned to client). Publishes a PostCreatedEvent to Kafka. Returns immediately — media processing is async. |
The client uploads media directly to S3 — the Post Service never proxies binary data. This eliminates app-tier memory pressure and allows S3 to handle multi-part uploads and resumable transfers natively. | Cassandra write failure → 503 to client. Kafka publish failure → post exists in Cassandra but fan-out delayed; a reconciliation job replays from the Cassandra WAL. |
| Media Processor | Python workers (Pillow / FFmpeg) on EC2 Spot | Triggered by S3 ObjectCreated event via SQS. Downloads the raw upload, generates five variants: thumbnail (150px), medium (640px), high-res (1080px), 4K original (stored cold), and a WebP version of each for modern browsers. For videos: generates HLS segments at 360p/720p/1080p, extracts a poster frame. Writes processed variants to the S3 Processed bucket and primes the CDN edge. |
Spot Instances cut processing cost by 70% vs on-demand. Workers are idempotent (output key is deterministic from post_id + variant), so interrupted jobs are safe to retry. WebP conversion reduces median photo payload by 35% vs JPEG, directly cutting CDN egress cost. |
Worker failure → SQS message becomes visible again after visibility timeout (30s). Max 3 retries; on DLQ after 3 failures — ops alert fires. Post is accessible but shows a processing placeholder image until the job completes. |
| Fan-out Service | Java 21 (virtual threads) worker pool | Consumes post.created events from Kafka. For each event, fetches the author's follower list from Social Graph Service. For followers below the celebrity threshold (10,000), writes the post_id into each follower's Redis timeline sorted set via a pipelined ZADD. Authors above the threshold are skipped — their posts are pulled at read time by Feed Service. |
Same hybrid push/pull strategy as Twitter. Threshold is tunable without code change (config flag). Redis pipeline batching collapses per-shard writes: 5,000 followers across 30 Redis shards = 30 pipeline calls, not 5,000 round-trips. | Kafka offset committed only after successful Redis pipeline write. Worker crash replays the batch. ZADD NX makes replay idempotent. Consumer lag is the primary freshness health signal. |
| Feed Service | Java microservice | Handles all GET /feed requests. Fetches up to 300 post_ids from the user's Redis timeline sorted set. In parallel, fetches recent posts from celebrity accounts the user follows (querying Cassandra). Merges and re-sorts both lists. Applies a lightweight ranking model (engagement + recency score) to re-order the top-50 before passing to Hydration Service. |
Instagram's feed is no longer purely reverse-chronological — a ranking overlay reorders content by predicted engagement. The ranking happens on the merged post ID list (not on full post objects) using precomputed engagement scores stored in Redis. This keeps ranking latency at ~5ms rather than running a full ML inference per feed request. | Redis miss (cold start) → fall back to DB reconstruction: query Social Graph for following list, then Cassandra for each followed author's recent posts. Expensive but rare. Pre-warm triggered by user.login Kafka event. |
| Hydration Service | Java microservice + Caffeine L1 cache | Takes an ordered list of post_ids and returns full post objects (caption, media URLs, like count, comment count, author display info). Batch MGET against a post object cache first; misses fall through to Cassandra. Enriches each post with author avatar and username from User Service (cached). Transforms S3 keys into signed CDN URLs. |
Viral posts are fetched by millions of simultaneous feed loads. A local in-process Caffeine cache (50K entries, 60s TTL) catches hot post objects before they reach Redis or Cassandra. Singleflight pattern prevents thundering herd: only one Cassandra read fires per hot post_id regardless of concurrency. |
Cassandra unavailable → return partial feed from cache only. Fail open (degraded feed) rather than a 503 — users see fewer posts, not an error. |
| Explore Service | Python ML service + Vector DB | Generates the personalised Explore grid. Step 1 (Candidate Generation): uses the user's interest embedding to run an ANN search against a Vector DB of post embeddings — returns ~500 candidate posts from non-followed accounts. Step 2 (Re-Ranking): a GBM model scores each candidate on predicted engagement (like, save, comment probability) using real-time features (recency, author engagement rate, user–category affinity). Returns top 50. | Post embeddings are computed offline by a separate embedding pipeline (daily batch + real-time updates for new posts via a Kafka consumer). The ANN search (HNSW) returns approximate neighbours in ~10ms — exact KNN over 50B posts would be infeasible. Explore results are cached per user for 15 minutes (Redis key: explore:{userId}) to avoid running the full ML pipeline on every Explore tab open. |
Vector DB unavailable → fall back to trending posts by category (pre-computed hourly, served from Redis). Ranking model failure → serve candidates unranked. Explore is a non-critical path — degradation is tolerated. |
| Post Store (Cassandra) | Apache Cassandra 4.x | Canonical store for all post metadata. Partitioned by (author_id, bucket) (bucket = YYYYMM), clustered by post_id DESC (TIMEUUID). Enables efficient "get this author's last N posts" queries — used by Feed Service for celebrity pulls and by profile page loads. Stories stored in a separate table with a TTL of 86,400 seconds (24 hours). |
Monthly bucket prevents hot partitions for prolific accounts. Story TTL is enforced by Cassandra natively — no separate cleanup job needed. Counter columns (like_count, comment_count) live in separate counter tables due to Cassandra's COUNTER type restriction. |
Node failure → RF=3 LOCAL_QUORUM masks it transparently. Multi-DC replication for regional HA. Read latency p99 target: 10ms for single-partition reads. |
| Timeline Cache (Redis) | Redis Cluster (sorted sets) | Each user's home feed stored as timeline:{userId} — a sorted set of post_ids scored by creation timestamp. Capped at 500 entries. Like counters stored as Redis Strings with INCR. Engagement score cache stored as sorted set engagement:{postId} for the ranking overlay. Story view sets stored as story_viewers:{storyId} sorted sets (TTL: 26h). |
Consolidated Redis usage: timeline, like counters, engagement scores, and story view tracking all live in the same cluster, separated by key prefix. This avoids running multiple Redis clusters with different SLOs. Memory budgeted at: 500 IDs × 8B × 500M users ≈ 2TB for timelines alone — requires aggressive TTL eviction for inactive users (7-day inactivity → TTL set, key evicted). | Node failure → Redis Cluster replica promotion (~seconds). Short miss storm absorbed by singleflight + async reconstruction. Eviction policy: allkeys-lru with volatile-lru for TTL keys. |
| Media Store (S3 + CDN) | S3 + CloudFront / Fastly | All processed media variants stored in S3 with content-addressed keys: media/{post_id}/{variant}.webp. Immutable after write. CDN serves all media — clients never hit S3 directly. CDN TTL is indefinite for immutable media keys. Pre-signed CDN URLs (signed with a secret key, 1h expiry) prevent hotlink abuse and unauthorised access to private account media. |
Content-addressed keys enable infinite CDN TTL — the key never changes after processing. Origin-shield layer (CloudFront regional cache) absorbs 90%+ of origin requests. 4K originals are stored in S3 Glacier Instant Retrieval — retrieved only for download requests or re-processing. | CDN node failure → CloudFront routes to alternate PoP. S3 unavailable → CDN serves cached variant for existing content; new uploads fail gracefully with client retry. Story media deleted from S3 after 48h (CDN TTL 24h + 24h buffer for propagation). |
| Hashtag Index (Elasticsearch) | Elasticsearch / OpenSearch | Inverted index: each hashtag maps to a list of post_ids with a composite score (recency + engagement). A Kafka consumer reads post.created events and indexes each hashtag extracted from the caption within seconds of upload. Queries: GET /hashtag/travel → Elasticsearch term query on hashtags field, sorted by score desc, paginated with search_after cursor. |
Elasticsearch is used only for hashtag search and full-text caption search — not for feed assembly (which uses Redis). This separation prevents search query spikes from impacting feed latency. Index sharded by hashtag hash to distribute write load evenly. Popular hashtags (#love, #instagood) have 2B+ posts — queries are capped at 10K results with cursor pagination. | Elasticsearch node failure → replica shard takes over. Indexing lag (Kafka consumer backlog) means new posts may not appear in hashtag search for up to 30s — acceptable eventual consistency. ES cluster isolated from feed path so degradation is scoped to search only. |
5. Deep Dive #
Media Upload Pipeline — Processing Before the First View #
The media pipeline is Instagram’s most operationally intensive subsystem, and the most Instagram-specific part of the architecture. Unlike Twitter (text-first), every Instagram post triggers a processing job before it can be served. Getting this wrong means either serving oversized originals to mobile clients (destroying CDN cost and load time) or blocking the user-facing write path on slow transcoding work.
The key principle is client-direct-to-S3 upload with async processing. The Post Service never touches binary data:
Client Post Service S3 Raw Media Processor S3 Processed
│ │ │ │ │
│──POST /post ──────────>│ │ │ │
│<──200 {post_id, │ │ │ │
│ upload_url} ──────│ │ │ │
│ │ │ │ │
│──PUT {upload_url} ─────────────────────> │ │ │
│<──200 (ETag) ──────────────────────────── │ │ │
│ │ │──ObjectCreated──> SQS ──────────────── >│
│ │ │ │ download + process │
│ │ │ │──write variants───> │
│ │ │ │──prime CDN edge ───> │The pre-signed URL has a 15-minute expiry and is scoped to a single S3 key. S3 enforces the content-length limit on the server side, so the Post Service never needs to buffer or validate the upload byte-stream. Multi-part upload (for videos > 5MB) is handled entirely by the S3 SDK on the client — the Post Service only provides the initial URL.
The Media Processor runs on EC2 Spot Instances behind an SQS queue (not triggered by Lambda, because transcoding 90-second Reels at 60fps exceeds Lambda’s 15-minute limit). Each worker processes one upload atomically:
# Media Processor — Python worker (simplified)
import boto3, PIL.Image, subprocess, pathlib
VARIANTS = {
"thumb": (150, 150),
"medium": (640, None), # None = maintain aspect ratio
"hq": (1080, None),
"original": None, # stored cold in Glacier
}
def process_photo(post_id: str, raw_key: str):
raw_path = download_from_s3(RAW_BUCKET, raw_key)
img = PIL.Image.open(raw_path).convert("RGB")
for variant, dims in VARIANTS.items():
if dims is None:
upload_to_s3(
PROCESSED_BUCKET,
f"media/{post_id}/original.jpg",
raw_path,
storage_class="GLACIER_IR"
)
continue
w, h = dims
resized = img.resize(
(w, int(img.height * w / img.width)) if h is None else (w, h),
PIL.Image.LANCZOS
)
webp_path = f"/tmp/{post_id}_{variant}.webp"
resized.save(webp_path, "WEBP", quality=82, method=6)
s3_key = f"media/{post_id}/{variant}.webp"
upload_to_s3(PROCESSED_BUCKET, s3_key, webp_path)
prime_cdn_edge(s3_key) # CloudFront CreateInvalidation + warm prefetch
def process_video(post_id: str, raw_key: str):
raw_path = download_from_s3(RAW_BUCKET, raw_key)
for resolution in ["360p", "720p", "1080p"]:
hls_dir = transcode_to_hls(raw_path, resolution) # FFmpeg, H.265
upload_hls_segments(PROCESSED_BUCKET, f"media/{post_id}/{resolution}/", hls_dir)
extract_poster_frame(raw_path, f"media/{post_id}/poster.webp")The prime_cdn_edge call issues a CloudFront cache warm request immediately after upload. This ensures the first follower to load the photo hits the CDN edge, not the S3 origin. Without this, a viral post published to millions of followers simultaneously would saturate the S3 origin with cache-miss requests.
WebP conversion is the single highest-leverage optimisation. At Instagram’s scale, switching from JPEG to WebP at equivalent visual quality (SSIM 0.95) reduces median photo payload from ~200KB to ~130KB — a 35% reduction. At 46,300 feed reads/s each serving 20 photos, that’s a 335 GB/s egress reduction that translates directly to CDN cost savings.
Feed Generation — Hybrid Fan-Out with Ranking Overlay #
Instagram’s feed is structurally identical to Twitter’s at the infrastructure layer (hybrid push/pull with a celebrity threshold), but adds a ranking step that Twitter’s original chronological feed didn’t have. The ranking challenge is doing this at 46,300 requests/second without running a full ML inference per request.
The solution is pre-computed engagement scores. A background ML pipeline runs every 5 minutes and writes a score for each post into Redis:
engagement_score:{post_id} → float (probability-weighted engagement score, 0-1)The score combines recency decay, author engagement rate, category preference for the requesting user’s interest vector, and recent like/comment velocity. The Feed Service retrieves this score via a batch MGET alongside the post IDs, then applies a weighted merge with the timestamp score:
final_score = 0.6 × engagement_score + 0.4 × recency_scoreThe recency score is computed inline in the Feed Service from the timestamp embedded in the TIMEUUID — no external call needed. The merged score list is sorted in-memory (top 300 posts, O(N log N) trivial at N=300), and the top 50 are passed to Hydration.
This approach keeps the feed assembly latency at ~40ms total (Redis ZRANGE + batch MGET for scores + sort) versus ~200ms for a real-time ML inference call. The trade-off is that engagement scores are up to 5 minutes stale — a post that goes viral in the last 5 minutes may not immediately get its elevated score. This is an acceptable product compromise.
The celebrity pull works identically to Twitter’s: for each celebrity the user follows, the Feed Service queries Cassandra for their most recent posts and merges them into the timeline using a standard merge-sort step. The celebrity_following:{userId} Redis set (maintained asynchronously on follow/unfollow) makes the celebrity identity check O(1) rather than requiring a follower count lookup for every account in the following list.
Explore Page — Candidate Generation and ML Re-Ranking #
The Explore page is architecturally distinct from the feed because it serves discovery: posts from accounts the user has never followed. This requires a fundamentally different candidate generation strategy — you cannot fan-out from unknown authors.
The pipeline has two stages:
Stage 1 — Candidate Generation (ANN Search): Every post gets an embedding vector (128 dimensions) computed by a vision model (ResNet + text encoder for caption). These embeddings are ingested into a Vector DB (Meta’s Faiss with HNSW index, or a managed service like Pinecone). When a user opens Explore, their interest vector (computed from their like/save/comment history, updated daily) is used as the query vector. An ANN search returns the 500 most semantically similar posts across the 50B post corpus in ~10ms.
Stage 2 — Re-Ranking (GBM Model): The 500 candidates are scored by a Gradient Boosted Machine model that incorporates:
- Post-level features: recency, author historical engagement rate, category
- User-level features: user–category affinity scores, device type (images vs Reels preference)
- Cross features: user interest vector × post embedding dot product (cosine similarity)
The GBM scores 500 candidates in ~20ms (pre-loaded model in-process). Top 50 are returned to the client as the Explore grid.
Caching: Explore results are cached per user for 15 minutes in Redis (explore:{userId}). This means the full ML pipeline runs at most once per 15 minutes per active user — at 500M DAU with 8 Explore opens/day, that’s ~46,300 ML pipeline runs/second at peak, reduced to ~1,550/second with the 15-minute cache. Without caching, the Vector DB and ML infrastructure would need to be 30× larger.
Hashtag Index — Real-Time Inverted Index at Petabyte Scale #
Instagram’s hashtag search must index ~116 million new posts per day across ~2 billion distinct hashtags, while keeping popular hashtag queries under 100ms.
The index is built on Elasticsearch with a custom document schema:
{
"post_id": "01HXYZ...",
"author_id": 12345,
"hashtags": ["travel", "photography", "italy"],
"caption": "Golden hour in Rome #travel #photography #italy",
"created_at": "2026-04-24T10:00:00Z",
"engagement_score": 0.82,
"is_private": false
}A Kafka consumer (hashtag-indexer) reads post.created events and bulk-indexes documents into Elasticsearch with a 1-second flush interval. At 1,160 posts/second average, this is ~70K documents/minute — well within Elasticsearch’s write capacity at 6-node clusters.
The critical design decision is dual-sort support: queries must support both sort_by=recency (for the “Recent” tab) and sort_by=top (for the “Top” tab). The engagement_score field (precomputed by the background ML pipeline and updated via partial document updates) enables sort_by=top without re-ranking at query time. The created_at field handles recency sort.
Hot hashtag mitigation: #love has 2B+ posts. A naive match query on this term would attempt to score 2B documents. Mitigations:
search_aftercursor pagination caps result set traversal.- Index aliases by hashtag popularity tier: ultra-popular hashtags are indexed in a separate, smaller index with aggressive document eviction (only posts from the last 7 days indexed for #love — older posts are too stale to surface usefully).
- Results for popular hashtags are cached in Redis for 30 seconds (
hashtag_top:{tag}:{page}).
6. Data Model #
Post Table (Cassandra) #
| Column | Type | Notes |
|---|---|---|
| author_id | BIGINT | Partition key |
| bucket | TEXT | Partition key (YYYYMM) — prevents hot partitions for prolific creators |
| post_id | TIMEUUID | Clustering key DESC — reverse-chron scan without sort |
| caption | TEXT | Max 2,200 characters |
| hashtags | LIST<TEXT> | Extracted at write time; also indexed in Elasticsearch |
| media_keys | LIST<TEXT> | S3/CDN keys by variant: media/{post_id}/{variant}.webp |
| location_id | BIGINT | NULL if no location tag |
| is_private | BOOLEAN | Controls fan-out eligibility |
| is_deleted | BOOLEAN | Soft delete; content cleared on GDPR erasure |
| created_at | TIMESTAMP | Denormalised from post_id for display |
Like/Comment counters (separate Cassandra table):
| Column | Type | Notes |
|---|---|---|
| post_id | TIMEUUID | Partition key |
| like_count | COUNTER | Cassandra COUNTER type; commutative INCR |
| comment_count | COUNTER | Same |
| save_count | COUNTER | Same |
Story Table (Cassandra — with TTL) #
| Column | Type | Notes |
|---|---|---|
| author_id | BIGINT | Partition key |
| story_id | TIMEUUID | Clustering key DESC |
| media_key | TEXT | S3/CDN key |
| expires_at | TIMESTAMP | Set to created_at + 86400s |
| view_count | INT | Denormalised from Redis at story close |
Row-level TTL of 86,400 seconds is set on INSERT. Cassandra purges the row automatically — no cron job required.
Timeline Cache (Redis) #
Key: timeline:{userId}
Type: Sorted Set
Score: unix timestamp (milliseconds) — from TIMEUUID
Value: post_id
Cap: 500 entries (ZREMRANGEBYRANK after each ZADD)
TTL: 7 days for inactive users (EXPIRE reset on each feed read)
No TTL for active users (kept hot by fan-out writes)Story View Tracking (Redis) #
Key: story_viewers:{storyId}
Type: Sorted Set
Score: view timestamp (ms)
Value: viewer_id
TTL: 26 hours (story 24h + 2h buffer for "who viewed" queries post-expiry)
Max: 5,000 entries — truncated for viral stories (approximated with HyperLogLog beyond threshold)Social Graph (Redis + MySQL) #
Redis (fan-out hot path):
followers:{userId} → SET { followerId, ... }
following:{userId} → SET { followeeId, ... }
celebrity_following:{userId} → SET { celebId, ... } (followee follower_count > 10K)
MySQL (source of truth):
follows(
follower_id BIGINT NOT NULL,
followee_id BIGINT NOT NULL,
created_at TIMESTAMP DEFAULT NOW(),
PRIMARY KEY (follower_id, followee_id),
INDEX idx_followee (followee_id, follower_id)
)7. Trade-offs #
Fan-Out on Write vs Read for Instagram #
| Approach | Pros | Cons | When to Use |
|---|---|---|---|
| Fan-out on Write (push) | O(1) reads; timeline always pre-built; predictable latency | Write amplification: 1 post → N Redis writes; celebrity accounts become O(100M) | Accounts with < 10K followers |
| Fan-out on Read (pull) | No write amplification; always fresh; handles celebrities naturally | O(F) reads per feed load; cold timelines expensive | Celebrities (> 10K followers); inactive users |
| Hybrid | Best of both for common case | Celebrity merge complexity; requires celebrity_following set maintenance |
Always — this is the production answer |
Instagram vs Twitter: Instagram’s celebrity threshold can be lower than Twitter’s (some analyses suggest ~5,000) because Instagram posts are less frequent (users post once vs tweet 20 times/day). Lower post frequency means a smaller Redis sorted set per follower even with a lower threshold — the memory savings from push are less dramatic, so it’s safe to lower the threshold and improve celebrity read-time pull costs.
Synchronous vs Async Media Processing #
| Option | Pros | Cons |
|---|---|---|
| Sync (process before returning post_id) | Client gets processed media URLs immediately | POST /post latency includes transcoding (seconds for video) — unacceptable |
| Async (return post_id immediately, process in background) | POST /post is fast (~50ms); processing scales independently on Spot | Client must poll for “processing complete” status; feed may briefly show placeholder |
Conclusion: Async always. The placeholder UX (grey thumbnail → replaced by real image within seconds) is universally preferred over a 5-second blocking upload API.
Redis Sorted Set vs List for Timeline #
| Option | Pros | Cons |
|---|---|---|
| Sorted Set | O(log N) insert; natural timestamp-ordered merge with celebrity posts; supports ranking score overlay | ~64 bytes/entry vs ~24 for List; higher memory |
| List | O(1) prepend; lower memory per entry | Celebrity post merge requires full load + re-sort (O(N log N) per request) |
Conclusion: Sorted Set. The celebrity merge and ranking score overlay both require ordered set semantics that a List cannot efficiently provide.
Elasticsearch vs Cassandra for Hashtag Search #
| Option | Pros | Cons |
|---|---|---|
| Elasticsearch | Native inverted index; full-text search; relevance scoring; aggregations | Higher operational complexity; not suited for primary post storage |
| Cassandra | Already in the system; simple | No inverted index support; hashtag queries would require full table scans or a materialised view with extreme fan-out |
Conclusion: Elasticsearch for search, Cassandra for post storage. They serve different access patterns and should not be conflated.
8. Failure Modes #
| Component | Failure | Impact | Mitigation |
|---|---|---|---|
| Media Processor (Spot eviction) | Worker terminated mid-transcode | Post exists in Cassandra but no processed media; feed shows placeholder | SQS visibility timeout re-queues the job. On-demand fallback worker pool handles DLQ messages within 5 minutes. |
| Fan-out Service crash | Mid-fan-out batch failure | Some followers see post immediately, others see it late | Kafka offset not committed → batch replays. ZADD NX makes replay idempotent. Lag alert fires at > 10K events. |
| Redis Timeline node failure | Cache miss storm on the affected shard | Timeline reads fall back to DB reconstruction for affected users | Redis Cluster promotes replica in seconds. Singleflight pattern collapses concurrent reconstruction calls per user. |
| CDN cache miss on viral post | Millions of requests hit S3 origin simultaneously | S3 throttling, increased latency | Origin-shield (regional CloudFront cache) absorbs traffic before it reaches S3. Media Processor primes CDN edge on upload — first follower load is always a CDN hit, not origin. |
| Elasticsearch indexing lag | Kafka consumer backlog | New posts missing from hashtag search for up to 30s | Acceptable eventual consistency for search. Feed path is completely independent of ES, so feed latency is unaffected. |
| Story TTL race condition | Story viewed 1 second after expiry | User sees expired story briefly | Cassandra TTL purges the row server-side; subsequent fetches return 404. CDN serves cached story media for up to 24h after CDN TTL expires — add Cache-Control: max-age=86400 on story media + CDN soft-purge on expiry. |
| Like counter hot key | Post going viral: 100K likes/second on one key | Redis INCR on single key saturates one shard | Shard the counter: likes:{post_id}:{shard} where shard = rand(0, 16). Aggregate on read with MGET + sum. Write throughput scales linearly with shard count. |
| Explore Vector DB unavailable | ANN search fails | Explore tab returns fallback (trending by category) | Pre-computed hourly trending grids per category stored in Redis serve as graceful degradation. Explore is non-critical; 5-minute stale fallback is acceptable. |
9. Security & Compliance #
AuthN/AuthZ: OAuth 2.0 with short-lived JWTs (15-min access token, 7-day refresh with rotation). API Gateway validates the JWT and injects userId into downstream request headers. Private account posts are fan-outed only to approved followers: the Fan-out Service checks is_private and the approved_followers:{userId} set before writing to each follower’s timeline. Media for private accounts is served via signed CDN URLs (1-hour expiry, signed with a per-account HMAC key) — unapproved users cannot access media even with a direct URL.
Encryption: TLS 1.3 for all in-transit data (client to gateway, service to service via mutual TLS). S3 buckets encrypted with SSE-S3 (AES-256). Redis encrypted at rest via encrypted EBS. KMS-managed keys with automatic 90-day rotation. Media CDN URLs are signed to prevent hotlink abuse.
Input Validation: Caption sanitised server-side (XSS stripping, Unicode normalisation, null byte removal). Hashtag extraction uses a whitelist character set (letters, numbers, underscores — no control characters). Media MIME type validated against magic bytes (not file extension) before the pre-signed URL is issued. Image dimensions capped at 20,000px per side to prevent decompression bomb attacks (gigapixel PNGs that expand to GB in memory).
Rate Limiting: Upload: 100 posts/24h per user (sliding window, token bucket in Redis). Like: 500/hour (burst allowance). Feed reads: 200/hour per user. These are enforced at the API Gateway. Automated bot detection (beyond rate limits) uses a separate ML classifier on user-agent, request pattern, and content hashes.
GDPR Right to Erasure: User deletion triggers async propagation: post metadata in Cassandra is soft-deleted (is_deleted=true, caption cleared), media is deleted from S3 (CDN TTL expiry propagates within 24h), post IDs are removed from all follower timeline sorted sets via a reverse fan-out using Social Graph. Elasticsearch documents are deleted by author_id query. Vector DB embeddings are deleted by author_id. Full propagation SLA: 30 days (aligning with Kafka retention for event replay coverage).
CSAM Detection: PhotoDNA hash matching runs on every uploaded image asynchronously (within 5 seconds of upload). Post is soft-held from fan-out until the CSAM scan returns clean. Videos are scanned frame-by-frame on a sampling schedule. Matches are reported to NCMEC and the post/account is hard-deleted.
10. Observability #
RED Metrics #
| Metric | Alert Threshold |
|---|---|
| Feed read latency p99 | > 300ms → page |
| Feed read latency p50 | > 80ms → warn |
| Media upload success rate | < 99.5% → page |
| Media processing lag (SQS depth) | > 500 jobs → warn; > 5,000 → page |
| Fan-out Kafka consumer lag | > 10K events → warn; > 100K → page |
| Timeline cache hit rate | < 90% → warn; < 80% → page |
| Explore ML pipeline p99 | > 150ms → warn |
| Story expiry accuracy | > 5s lag → warn |
Saturation Metrics #
- Redis memory utilisation per shard: warn at 70%, page at 85%
- S3 processed bucket egress: warn at 80% of region quota
- Cassandra read latency p99 per node: warn at 20ms, page at 50ms
- Elasticsearch indexing rate: warn at 80% of cluster write capacity
- Media Processor SQS queue depth (per-worker): warn at 20 jobs queued per worker
Business Metrics #
- Posts per second (rolling 1-min window) — deviation > 25% from baseline triggers anomaly alert
- Feed impression rate per user per day (measures engagement health)
- Fan-out amplification ratio:
(fan-out Redis writes) / (posts created)— should track average follower count; sudden drop indicates fan-out workers falling behind - Media processing success rate (photo vs video separately) — video failures spike during EC2 Spot eviction storms
- Explore click-through rate (tracks ML model quality — threshold: CTR drop > 5% over 1h → page ML oncall)
Tracing #
OpenTelemetry with tail-based sampling: 10% of normal requests, 100% of requests with any span > 200ms, 100% of errors. Key trace spans: post.create → s3.presign → kafka.publish, and separately: s3.objectcreated → media.download → media.resize → cdn.prime. Feed path: feed.read → redis.zrange → celebrity.pull → hydration.mget → cdn.url_sign. Jaeger for trace storage; Grafana dashboards for RED metrics and fan-out health.
11. Scaling Path #
Phase 1 — MVP (< 1K RPS reads) #
Single PostgreSQL DB. Media resized synchronously on upload in the API server process. No fan-out — timeline is a SELECT * FROM posts WHERE author_id = ANY($following_list) ORDER BY created_at DESC LIMIT 20. Simple, correct, fast enough at low scale.
What breaks first: DB CPU at ~5K timeline reads/s when following-list queries hit 200+ accounts each. Media resizing in-process causes POST /post to take 3-5 seconds for photos. App server OOMs on concurrent uploads.
Phase 2 — 10K RPS reads #
Move media processing to async workers (SQS + Python workers). Migrate posts to Cassandra. Add Redis for timeline caching with synchronous fan-out. Add read replicas for Social Graph MySQL.
What breaks first: Synchronous fan-out causes POST /post latency to spike for users with many followers. First celebrity account (> 50K followers) causes tweet API p99 to degrade to seconds.
Phase 3 — 100K RPS reads #
Async fan-out via Kafka. Celebrity threshold (10K followers → pull model). Redis Cluster. Multi-DC Cassandra replication. Separate Hydration Service tier. Elasticsearch for hashtag search (replaces Cassandra ALLOW FILTERING queries). WebP conversion in Media Processor.
What breaks first: Redis memory. 500M users × 500 IDs × 8 bytes ≈ 2TB. Aggressive TTL eviction for inactive users. Explore page at this scale is a simple trending grid — personalised ML not yet feasible.
Phase 4 — 1M+ RPS reads #
Personalised Explore ML pipeline (Vector DB + GBM ranker). Tiered media storage (hot S3 → warm S3-IA → cold Glacier IR for originals). Geo-distributed fan-out workers. CDN origin-shield layer. Engagement score pre-computation for feed ranking overlay. Pre-signed CDN URLs for private account media. Story view tracking with HyperLogLog for viral stories (beyond 5,000 viewers). Predictive CDN pre-warming: ML predicts which posts will go viral within 1h of upload based on author engagement history.
12. Enterprise Considerations #
Brownfield Integration: Migrating from a monolith, use Strangler Fig — route POST /upload and GET /feed to new services while legacy handles everything else. Media can be backfilled from existing storage by re-running the Media Processor pipeline on raw originals. Timeline Redis can be bootstrapped from a one-time Cassandra scan (read each user’s following list, fetch last 500 posts from each followed author, build sorted sets). Run backfill during off-peak hours at throttled throughput.
Build vs Buy:
- Media Processing: Build the orchestration (SQS → worker → S3), buy the libraries (Pillow, FFmpeg, libvips). Do not build a transcoding engine.
- Fan-out: Build (domain-specific fan-out rules — celebrity threshold, private account gating, story fan-out TTL — are too specific for off-the-shelf).
- Hashtag Search: Buy Elasticsearch/OpenSearch (Confluent Cloud for managed Kafka to feed it). Do not build an inverted index.
- Vector DB: Buy (Pinecone, Weaviate, or Milvus) for Explore. Building HNSW at 50B scale is a team-year project.
- CDN: Cloudflare or CloudFront. Never build a CDN.
- Object Storage: S3/GCS. Never build this.
Multi-Tenancy (B2B variant): For a white-label social platform, isolate at the Redis key prefix and Cassandra keyspace level: timeline:{tenantId}:{userId}, separate Kafka topics per tenant tier. Fan-out workers use separate consumer groups per tenant to prevent noisy-neighbour fan-out from affecting other tenants’ freshness SLOs.
TCO Ballpark (100K RPS reads):
| Component | Config | Est. Monthly Cost |
|---|---|---|
| Redis Cluster (timelines + counters) | 6× r7g.4xlarge (122GB RAM each) | ~$12,000 |
| Cassandra cluster | 8× i4i.4xlarge (NVMe SSD) | ~$16,000 |
| Kafka (MSK) | 3 brokers m5.2xlarge | ~$3,000 |
| Media Processor (Spot) | 50× c7g.2xlarge (70% spot) | ~$5,000 |
| Elasticsearch | 6× r6g.2xlarge | ~$6,000 |
| S3 + CDN egress | 131TB/day storage + ~74GB/s peak egress | ~$40,000 |
| Total | ~$82,000/mo |
CDN egress is the dominant cost line at scale — not compute. WebP conversion and CDN cache optimisation directly reduce this. At 1M+ RPS, CDN egress routinely exceeds all compute costs combined.
Conway’s Law: Upload, Feed, Explore, and Stories should each be owned by separate teams. The fan-out amplification ratio is a shared KPI owned jointly by the Upload and Feed teams — it’s the single number that best reflects the health of the write-read contract. The CDN origin hit rate is owned jointly by Media and Infrastructure — it’s the cost/performance lever that transcends both team boundaries.
13. Interview Tips #
- Start with the upload pipeline. Most interviewers expect you to jump straight to feed generation, but the media pipeline is what makes Instagram different from Twitter. Mention pre-signed S3 upload URL and async processing in your first 2 minutes — it immediately signals Instagram-specific depth.
- Challenge the synchronous upload assumption. If you sketch a design where the API server resizes photos before returning a 200, the interviewer will probe until you hit the latency problem. Get there first: “Photo resizing takes ~500ms, video transcoding minutes — this has to be async.” Then describe the placeholder UX.
- Fan-out is the same as Twitter — acknowledge it and move fast. Don’t spend 10 minutes re-explaining push/pull hybrid. Say “same hybrid fan-out as a Twitter feed, celebrity threshold at 10K” and move to Instagram’s unique challenges: media pipeline, ranking overlay, Explore, Stories, hashtag index.
- Explore is a two-stage pipeline. Many candidates say “recommendation system” without explaining how you generate candidates from a 50B-post corpus without scanning it all. The ANN search over post embeddings is the key insight. Name HNSW and approximate nearest neighbour — it signals you know the tradeoff between recall and latency.
- Story expiry is a Cassandra TTL question in disguise. The interview question “how do you expire stories after 24 hours?” is really asking whether you know about Cassandra’s native TTL, Redis key TTL, and scheduled job patterns. Cassandra TTL is the cleanest answer — no external cron job, no at-scale delete storms.
- Like counter hot key is the concurrency question. A post with 1M likes in 1 hour generates ~280 INCR operations/second on a single Redis key — that’s fine. A post with 10M likes/hour is 2,800/second — still fine. But a viral post during the Super Bowl could hit 1M/minute (16,667/second) on a single key. Introduce counter sharding (
likes:{post_id}:{shard}) before the interviewer asks. - Vocabulary that signals fluency: fan-out amplification ratio, pre-signed upload URL, content-addressed media keys, HLS adaptive bitrate, interest embedding, ANN search, HNSW, engagement score pre-computation, singleflight pattern, Cassandra TTL for Stories, WebP egress optimisation.
14. Further Reading #
- “Scaling Instagram Infrastructure” (Instagram Engineering Blog) — Instagram’s original 2012 talk on moving from PostgreSQL to Cassandra, and their 2017 talk on moving to a single Python Django monolith deliberately to simplify operations.
- “Unicorn: A System for Searching the Social Graph” (Meta, VLDB 2019) — How Meta handles social graph queries at multi-billion-node scale; directly applicable to the Social Graph Service design.
- “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks” (Google, 2019) — The architecture behind modern image embedding models used for Explore candidate generation.
- “FAISS: A Library for Efficient Similarity Search” (Meta AI) — The ANN library underlying Instagram’s Explore candidate generation pipeline; covers HNSW and IVF index tradeoffs.
- Martin Kleppmann, “Designing Data-Intensive Applications” — Chapter 3 covers LSM trees (Cassandra storage engine); Chapter 11 covers Kafka stream processing for fan-out and hashtag indexing pipelines.
- “Scaling to 100M Users” (High Scalability blog) — Instagram’s early architectural decisions and why a small team was able to scale to 100M users with ~13 engineers.