1. Hook #
At peak, Netflix accounts for 15% of global internet downstream traffic — roughly 700 Gbps flowing to subscribers in 190 countries. What makes this feasible is not raw bandwidth: it is a carefully engineered pipeline that converts every raw title into over 1,200 encoded video files before a single subscriber presses play, then serves those files from ISP-embedded appliances called Open Connect Appliances (OCA) rather than from a traditional cloud CDN. The streaming experience you see — where the picture quality silently improves while you watch — is ABR (Adaptive Bitrate) streaming dynamically switching between those pre-encoded variants based on your network conditions. Behind the personalised rows on the homepage sits a recommendation engine that runs 45+ algorithms to surface the title you are most likely to start watching in the next 30 seconds. Each of these subsystems operates at a scale where a 0.1% drop in streaming reliability translates to 250,000 subscribers unable to watch at that moment.
2. Problem Statement #
Functional Requirements #
- Users can browse a personalised catalogue and play any licensed title.
- Video playback must start within 2 seconds and maintain seamless quality at varying bandwidths (ABR).
- Users can resume playback from the last-watched position across devices.
- Personalised homepage rows (Continue Watching, Top Picks, Because You Watched…).
- Search by title, genre, actor, or keyword.
- Content ingestion: studios submit raw files; Netflix transcodes and distributes to the CDN before launch date.
Non-Functional Requirements #
| Attribute | Target |
|---|---|
| Playback start latency (p95) | < 2s |
| Rebuffering ratio | < 0.1% of playback time |
| Global availability | 99.99% (< 53 min downtime/year) |
| CDN cache hit rate | > 95% of stream bytes |
| Recommendation freshness | < 24h after viewing |
| Scale | 250M subscribers, ~100M concurrent streams at peak |
Out of Scope #
- Live streaming (Netflix Live is a separate, simpler path)
- Content licensing and rights management
- Studio-side DRM (Digital Rights Management) key management
- Billing and subscription management
3. Scale Estimation #
Assumptions:
- 250M subscribers; ~40% active on any given evening peak (~100M concurrent).
- Average stream bitrate: 4 Mbps (mix of HD and UHD content).
- Average viewing session: 90 minutes.
- Catalogue: 15,000 titles; each title encoded into ~1,200 files averaging 50 GB/file.
- 95% of bytes served from OCA edge; 5% origin fallback.
| Metric | Calculation | Result |
|---|---|---|
| Peak concurrent streams | 250M × 40% | ~100M streams |
| Peak egress bandwidth | 100M × 4 Mbps | ~400 Tbps |
| OCA-served bandwidth | 400 Tbps × 95% | ~380 Tbps |
| Origin egress | 400 Tbps × 5% | ~20 Tbps |
| Catalogue storage (encoded) | 15,000 × 1,200 × 50 GB | ~900 PB |
| New title ingest/day | 10 titles × 1,200 files × 50 GB | ~600 TB/day transcode output |
| Playback event writes/s | 100M streams × 1 event/30s | ~3.3M writes/s |
| Recommendation API calls/s | 250M × 5 opens/day / 86,400 | ~14,500 rec calls/s |
The 400 Tbps peak egress is the dominant engineering constraint. No single cloud provider can serve this economically; hence Netflix operates its own CDN embedded inside ISP networks.
4. High-Level Design #
The system decomposes into three independent planes: the content ingestion plane (raw upload → transcode → CDN pre-position), the streaming plane (manifest generation → OCA selection → ABR playback), and the discovery plane (personalisation, search, and browse).
flowchart TD
subgraph Studio["Studio / Content Partner"]
RAW["Raw Mezzanine File\n(ProRes / IMF)"]
end
subgraph Ingest["Content Ingestion Plane"]
CIP["Content Ingestion\nService (Hollow)"]
CHUNKER["Media Chunker\n2-second segments"]
ENCODER["Distributed Encoder\nAWS + Spot Fleet"]
S3O["S3 Origin Store\n900 PB encoded files"]
PREPOS["Pre-Position Service\nOCA Seeder"]
end
subgraph OCA["Open Connect CDN"]
OCA1["OCA Appliance\nISP-embedded NFS"]
OCA2["OCA Cluster\nRegional PoP"]
end
subgraph API["API Layer"]
GW["API Gateway\nZuul — Auth · Rate · Route"]
PLAY["Playback Service\nManifest · License · OCA URL"]
REC["Recommendation\nService (Meson)"]
SRCH["Search Service\nElasticsearch"]
RES["Resume Point Service"]
end
subgraph Client["Client"]
APP["Netflix App\nABR Player (ExoPlayer / Custom)"]
end
subgraph Data["Data Layer"]
CASS["Cassandra\nViewing history · Resume points"]
EVT["Kafka\nPlayback events stream"]
FLINK["Flink\nReal-time viewing aggregation"]
EVE["EV Cache (memcached)\nRec rows · Homepage"]
end
RAW --> CIP
CIP --> CHUNKER
CHUNKER --> ENCODER
ENCODER --> S3O
S3O --> PREPOS
PREPOS -->|"push popular titles"| OCA1
PREPOS -->|"push popular titles"| OCA2
APP -->|"1 — GET /api/manifest"| GW
GW --> PLAY
PLAY -->|"pick nearest OCA"| OCA2
PLAY -->|"DRM license"| PLAY
PLAY -->|"signed manifest URL"| APP
APP -->|"2 — fetch segments direct"| OCA1
APP -->|"3 — playback heartbeat"| GW
GW --> RES
RES --> CASS
GW --> EVT
EVT --> FLINK
FLINK --> CASS
APP -->|"GET /home"| GW
GW --> REC
REC -->|"cached rows"| EVE
REC -->|"history"| CASS
Component Reference #
| Component | Technology | Responsibility |
|---|---|---|
| Content Ingestion Service | Java + Hollow (in-memory dataset) | Receive studio files, validate, chunk into 2-second GoP-aligned segments |
| Distributed Encoder | VMAF-optimised FFmpeg on AWS Spot | Produce 1,200+ renditions across codecs (H.264, H.265, AV1) and resolutions |
| Open Connect Appliance | Custom NFS servers, 100–280 TB SSD per unit | ISP-embedded edge cache serving 95%+ of stream bytes, avoiding public internet |
| Playback Service | Java microservice, Zuul gateway | Generate ABR manifest (DASH/HLS), select nearest OCA, issue DRM license token |
| ABR Player | ExoPlayer (Android), custom (iOS/TV) | Real-time bandwidth estimation; switches bitrate without rebuffering |
| Recommendation Service (Meson) | Python ML + Java serving, EV Cache | Run 45+ algorithms, merge, rank, and cache personalised homepage rows |
| EV Cache | Memcached clusters, multi-AZ replicated | Cache recommendation rows, metadata, and frequently-read user state |
| Viewing History (Cassandra) | Apache Cassandra, wide rows by user_id | Resume points, playback history, watch progress for 250M users |
| Flink Pipeline | Apache Flink on Kafka streams | Aggregate real-time viewing events for A/B metrics and recommendation freshness |
5. Deep Dive #
5.1 Content Encoding Pipeline #
The raw studio file arrives as a ProRes or IMF (Interoperable Master Format) package. Netflix’s pipeline, called Archer, performs per-title encoding optimisation: instead of encoding every title at fixed bitrate ladders, it analyses scene complexity (using VMAF — Video Multimethod Assessment Fusion) and assigns each shot its optimal QP (Quantisation Parameter). A simple talking-head scene needs far fewer bits than a fast-action chase sequence to look identical to the human eye.
The chunker splits the mezzanine into 2-second GoP (Group of Pictures) aligned segments. These boundaries are intentional: the ABR player can switch renditions only at GoP boundaries without visual glitches. After chunking, thousands of AWS Spot instances encode each 2-second chunk independently, then segments are stitched back in order. This parallel encoding reduces a feature film from 24 hours of single-machine encoding to under 30 minutes.
Output: ~1,200 files per title (6+ codecs × 20+ bitrate rungs × audio variants × subtitle tracks), stored in S3. Popular titles are immediately pre-positioned to OCAs globally.
// Simplified segment assignment for parallel encoding
record EncodeJob(String titleId, int segmentIndex, Duration offset,
Codec codec, int targetBitrate) {}
public List<EncodeJob> explodeJobs(Title title, List<Codec> codecs,
List<Integer> bitrateRung) {
var jobs = new ArrayList<EncodeJob>();
for (var segment : title.segments()) { // 2-second chunks
for (var codec : codecs) {
for (var bitrate : bitrateRung) {
jobs.add(new EncodeJob(
title.id(), segment.index(),
segment.offset(), codec, bitrate));
}
}
}
return jobs; // dispatched to Spot fleet via SQS
}5.2 Open Connect CDN #
Most CDNs sit in hyperscaler datacentres and traffic traverses the public internet to reach subscribers. Netflix instead installs OCA appliances inside ISP and IXP (Internet Exchange Point) racks under a free-of-charge agreement: ISPs get to offload upstream transit costs; Netflix pays only hardware.
Each OCA holds 100–280 TB of SSD content. A pre-positioning algorithm (the OCA seeder) runs nightly: it predicts what each OCA’s subscriber base will watch tomorrow based on historical demand, pushes that content proactively via an internal backbone, so that when a subscriber presses play, the segment is already local.
OCA selection by the Playback Service:
- Identify subscriber’s ISP and geography.
- Query the OCA steering service for the top-3 candidate appliances by latency (via BGP-level proximity).
- Issue a manifest whose segment URLs point to that OCA.
- If the OCA misses (< 5% of the time), fall back to S3 origin.
5.3 ABR (Adaptive Bitrate) Streaming #
The manifest served to the player is a DASH (Dynamic Adaptive Streaming over HTTP) or HLS (HTTP Live Streaming) MPD file listing all renditions and their segment URLs. The player’s ABR algorithm (BOLA — Buffer Occupancy based Lyapunov Algorithm) makes a switching decision every 2 seconds:
- If buffer > 20 seconds: step up to next bitrate rung.
- If buffer < 10 seconds: step down immediately.
- Bandwidth estimate is an exponential moving average of the last 5 segment download speeds.
The result: a user on a fluctuating LTE connection never sees a rebuffering spinner — they silently watch at 720p instead of 4K until the network recovers.
5.4 Recommendation Engine #
Netflix’s Meson orchestration layer runs 45+ individual recommendation algorithms in parallel: collaborative filtering, content-based similarity, trending-in-country, continue-watching, because-you-watched, and more. Each produces a ranked list of titles. Meson merges these lists into homepage rows using a stacking policy (diversity constraints prevent the same genre appearing three rows in a row).
Candidate generation uses a two-tower neural network: a user tower encodes a subscriber’s viewing history and demographic features into a 256-dimensional embedding; a title tower encodes content features. ANN (Approximate Nearest Neighbour) search over the title tower retrieves the top-500 candidate titles in milliseconds.
Re-ranking applies a GBM (Gradient Boosted Machine) that considers context (time of day, device, session length) and short-term signals (what you watched last night) to produce the final ordered list cached in EV Cache for 30 minutes.
6. Data Model #
6.1 Viewing History & Resume Points (Cassandra) #
Table: viewing_history
Partition key: user_id (UUID)
Clustering key: viewed_at DESC
Columns:
title_id UUID
profile_id UUID -- Netflix supports multiple profiles per account
resume_offset DURATION -- seconds from start
watched_pct FLOAT -- 0.0–1.0
device_type TEXT
viewed_at TIMESTAMPWide rows by user_id allow fast LIMIT 500 scans for a user’s recent history without cross-partition joins. TTL (Time-To-Live) of 2 years keeps storage bounded.
6.2 Content Metadata (MySQL + Hollow) #
Canonical metadata (title, genre, cast, ratings) lives in MySQL (ACID for rights and metadata updates). Netflix Hollow replicates a snapshot to an in-memory read-only dataset on every JVM in the fleet, so metadata reads are sub-microsecond with zero network hops.
6.3 Playback Event Stream (Kafka) #
| Field | Type | Notes |
|---|---|---|
user_id |
UUID | Partitioned by user_id for ordering |
session_id |
UUID | Groups events for one viewing session |
title_id |
UUID | |
event_type |
ENUM | start, pause, seek, quality_change, stop |
playback_position |
DURATION | Current position in stream |
rendition_bitrate |
INT | kbps, for QoE monitoring |
rebuffer_count |
INT | Rebuffering events since last heartbeat |
ts |
TIMESTAMP | Client-side wall clock |
Kafka topic playback-events has 1,024 partitions (scales to ~3M events/s). Flink jobs consume this stream to update Cassandra resume points and feed real-time A/B experiment metrics.
7. Trade-offs #
7.1 Push (Pre-position) vs Pull (On-demand fetch) for CDN #
| Option | Pros | Cons | When |
|---|---|---|---|
| Pre-position (Push) | Zero cache-miss latency, predictable ISP traffic | Storage cost at edge, wasted storage for unpopular titles | Netflix: popular catalogue |
| On-demand (Pull) | No wasted storage, always fresh | First-viewer pays origin latency; unpredictable burst | Long-tail content, live events |
| Hybrid | Optimises for popularity distribution | Complexity in prediction model | Netflix’s actual approach |
Decision: Netflix pre-positions the top ~20% of titles that account for ~80% of streams; the long tail is served on-demand from S3 origin through OCA pull.
7.2 DASH vs HLS #
| Attribute | DASH | HLS |
|---|---|---|
| Segment format | fMP4 (fragmented MP4) | MPEG-TS or fMP4 |
| DRM support | Widevine, PlayReady natively | FairPlay (Apple only) |
| Latency | ~6-10s live; near-zero for VOD | Similar |
| Adoption | Android, Smart TVs, Web | iOS, macOS, Safari required |
Decision: Netflix serves DASH on most devices (better codec flexibility and DRM) and HLS on Apple devices (platform requirement).
7.3 Monolithic vs Microservice Encoding Pipeline #
Early Netflix encoded on monolithic on-premise servers. Moving to a distributed Spot fleet reduced encoding cost by 60% (Spot is 70-90% cheaper than On-Demand) at the cost of fault-tolerance complexity: any Spot instance can be reclaimed with 2 minutes notice, so each segment is encoded idempotently and checkpointed to S3. If an instance is reclaimed, the job is retried on a different instance with no data loss.
7.4 CAP: Availability over Consistency for Viewing History #
Cassandra’s tunable consistency allows Netflix to choose QUORUM for critical writes (resume point) and ONE for reads. If a Cassandra node is unavailable, the player starts from the beginning rather than returning an error — an availability-over-consistency choice that errs on the side of a working product.
8. Failure Modes #
| Component | Failure | Impact | Mitigation |
|---|---|---|---|
| OCA appliance | Hardware failure | Subscribers in that ISP see playback start failures | Playback Service fails over to secondary OCA; origin fallback via S3 |
| Playback Service | All instances unhealthy | No new streams can start | Multi-region active-active; Hystrix (now Resilience4j) circuit breaker |
| Cassandra cluster | Network partition | Resume point read fails | Return offset=0 (start from beginning) — availability > consistency |
| Kafka consumer lag | Flink falls behind | Resume points stale by minutes | Lag alerting at 30s; Flink auto-scales consumer group; DLQ (Dead Letter Queue) for malformed events |
| Recommendation service | Cold start / model crash | Homepage shows stale or generic rows | EV Cache serves last-good cached rows for up to 1 hour; fallback to globally-popular titles |
| Thundering herd | New popular title released | Millions simultaneously request OCA for uncached segments | Pre-position runs 24h before release; jitter added to manifest TTL to spread OCA fetch |
| Hot partition in Cassandra | Celebrity account (shared profile_id) | Single partition overwhelmed | Profile-level sharding; write-behind with Kafka buffer |
| S3 origin slow | CDN miss path degraded | Long start times for long-tail content | S3 Transfer Acceleration; multi-region S3 replication |
9. Security & Compliance #
Authentication & Authorisation: Users authenticate via OAuth2 tokens; the API Gateway validates tokens using a shared JWK (JSON Web Key) set. Device-level tokens are rotated on each session. Profile-level access within an account uses a separate scoped token.
DRM: Netflix uses a multi-DRM approach — Widevine (Google) for Android/Chrome/Smart TVs, PlayReady (Microsoft) for Windows/Xbox, and FairPlay (Apple) for iOS/macOS. The Playback Service issues a short-lived (4-hour TTL) license token per session; the player cannot cache the decryption key beyond that window.
Encryption in Transit: All client traffic uses TLS 1.3 (Transport Layer Security). OCA-to-origin traffic uses mTLS (mutual TLS) with certificate pinning.
Encryption at Rest: S3 objects encrypted with AES-256 (Advanced Encryption Standard) using per-title KMS (Key Management Service) keys. Cassandra at-rest encryption using TDE (Transparent Data Encryption).
Input Validation: The Playback Service validates all manifest requests: title must be licensed for subscriber’s country; profile must belong to the account; device fingerprint must match the issued token.
PII (Personally Identifiable Information) / GDPR: Viewing history is subject to GDPR right-to-erasure. Netflix implements crypto-shredding: each user’s Cassandra data is encrypted with a per-user DEK (Data Encryption Key) stored in Vault; erasure deletes the DEK, making historical rows unreadable without needing to delete Cassandra rows.
Content Security: Watermarking embeds a per-subscriber invisible forensic watermark in video frames at encode time, enabling piracy tracing.
Audit Logging: All Playback Service decisions (OCA selected, DRM license issued) are written to an immutable audit log in S3 for SOC 2 (System and Organisation Controls 2) compliance.
Rate Limiting: API Gateway enforces per-account rate limits on manifest requests (max 50/minute) to prevent credential-sharing automation.
10. Observability #
RED Metrics (Rate, Errors, Duration) #
| Service | Rate | Error | Duration |
|---|---|---|---|
| Playback Service | Manifest requests/s | 5xx rate | p99 manifest latency |
| OCA | Bytes served/s | Cache miss rate | Segment fetch latency (p95) |
| Recommendation | Homepage loads/s | Rec service error rate | Row generation latency |
| Cassandra | Reads + writes/s | Timeout rate | p99 read/write latency |
Streaming Quality (QoE — Quality of Experience) #
| Metric | Alert Threshold | Why |
|---|---|---|
| Rebuffering ratio | > 0.1% of playback time | Leading indicator of churn |
| Video start failure rate | > 0.5% of play attempts | Upstream playback issues |
| Startup latency (p95) | > 2s | Direct UX impact |
| Bitrate switches down | > 2 per session avg | Network or OCA instability |
Business Metrics #
- Stream completions (> 90% viewed) — content quality proxy
- Engagement per session — recommendation effectiveness
- OCA cache hit ratio — cost efficiency signal
Tracing #
Netflix uses Edgar (an internal OpenTelemetry-based system) to trace the full playback path: client → Zuul → Playback Service → OCA selection → segment fetch. Tail-based sampling at 1% of sessions, 100% on sessions with rebuffering events.
Alerting #
- PagerDuty on-call for rebuffering ratio spike (3-minute rolling average > threshold).
- Automated OCA health probes every 30 seconds; unhealthy OCA removed from steering table within 90 seconds.
11. Scaling Path #
Phase 1 — MVP (< 1,000 concurrent streams) #
Single-region. Nginx/FFmpeg on EC2, videos in S3, PostgreSQL for user state. A monolithic Java API serves everything. Manual encoding jobs. No CDN — serve directly from S3 with CloudFront as a simple cache. What breaks first: S3 egress cost and latency as concurrent streams grow.
Phase 2 — Regional Scale (1K → 100K concurrent streams) #
Decompose into Playback, Recommendation, and User microservices. Add CloudFront CDN. Replace PostgreSQL with Cassandra for user history (write amplification kills RDBMS at this scale). Begin automated distributed encoding on Spot. Add Kafka for playback event streaming. What breaks first: CloudFront CDN cache hit rate drops below 80% for long-tail content; costs spike.
Phase 3 — National Scale (100K → 10M concurrent streams) #
Deploy Open Connect Appliances in the top-20 ISP partners. Add EV Cache (memcached) layer for recommendation rows. Implement Hollow for metadata. Build the pre-positioning system. Shard Cassandra to 3 regional clusters. What breaks first: OCA coverage gaps; subscribers without an OCA-partnered ISP see high origin egress.
Phase 4 — Global Scale (10M → 100M+ concurrent streams) #
Full global OCA rollout (1,000+ appliances). Multi-region active-active API layer. Per-title VMAF-optimised encoding. AV1 codec for 30% bandwidth reduction. A/B framework drives every algorithm decision. Chaos Engineering (Chaos Monkey, Chaos Kong for region failures) is standard practice. What breaks now: encoding pipeline throughput for growing catalogue; recommendation model freshness at this data volume.
12. Enterprise Considerations #
Brownfield Integration: Enterprises migrating legacy VOD (Video-on-Demand) platforms to this architecture should start with the playback service and CDN layer first — these deliver the biggest latency and cost wins — before investing in the recommendation engine (highest ML complexity).
Build vs Buy:
| Component | Build | Buy |
|---|---|---|
| ABR encoding | Netflix built Archer + VMAF (title-level optimisation is unique) | FFmpeg is open-source; AWS Elemental for managed encoding |
| CDN | Netflix built OCA (ISP partnerships justify it at 400 Tbps) | CloudFront / Fastly / Akamai — correct choice for < 10 Tbps |
| Recommendation | Build two-tower + GBM on your own data | AWS Personalize, Google Recommendations AI |
| DRM | Integrate Widevine + FairPlay SDKs | AWS Elemental MediaConvert for DRM packaging |
Multi-Tenancy: A multi-tenant streaming platform (e.g., an OTT (Over-The-Top) SaaS provider serving multiple media companies) must isolate encoding pipelines per tenant, enforce per-tenant content security policies, and prevent cross-tenant DRM key exposure. Separate Cassandra keyspaces per tenant; shared Kafka with per-tenant topic prefixes.
Vendor Lock-in: The OCA appliance strategy creates hardware lock-in (custom NFS hardware). For a smaller operator, Fastly or Cloudflare Stream avoids lock-in at higher per-GB cost. Netflix’s 400 Tbps justifies the capex; at 10 Tbps the economics flip.
TCO (Total Cost of Ownership) Ballpark: At 10 Tbps peak egress, CloudFront costs ~$0.0085/GB ≈ $8.5M/month; OCA amortised hardware + hosting ≈ $1.5M/month at this scale. OCA breaks even at roughly 5 Tbps sustained.
Conway’s Law: Netflix’s org mirrors the architecture — separate teams own encoding, CDN, playback, and recommendations. The API contract between Playback Service and OCA is the boundary where team autonomy ends. Any shared-nothing boundary reduces coordination tax.
13. Interview Tips #
- Clarify “streaming” scope early. VOD (Video on Demand) and live streaming are architecturally distinct. Confirm which one before drawing any diagram. VOD = pre-encoded assets + CDN; live = ingest → transcode → ultra-low-latency delivery path.
- Drive toward the CDN. Interviewers expect you to articulate why serving from origin is untenable at Netflix scale and what a CDN buys you. Name OCA or at least the concept of ISP-embedded edge nodes.
- Show the encoding pipeline. Candidates who skip directly to “store in S3 and stream” miss the dominant complexity: you must pre-encode 1,200 renditions before a single play is possible. Explain why GoP-aligned 2-second segments are required for ABR switching.
- Name ABR explicitly. “The player dynamically switches quality” is not enough. Name the acronym ABR, explain that it requires the manifest to list all renditions, and that the player makes a switching decision every segment.
- Common mistake: Designing a single-bitrate streaming system. Any senior interviewer will immediately ask “what happens when the user’s network degrades?” — the answer is ABR, not “it buffers.”
- Fluency vocabulary: GoP (Group of Pictures), VMAF (Video Multimethod Assessment Fusion), ABR (Adaptive Bitrate), DASH (Dynamic Adaptive Streaming over HTTP), HLS (HTTP Live Streaming), OCA (Open Connect Appliance), DRM (Digital Rights Management), QoE (Quality of Experience), EV Cache, Hollow, fMP4 (fragmented MP4), OTT (Over-The-Top).
14. Further Reading #
- Netflix Tech Blog — “Optimizing the Netflix Streaming Experience with Data Science” — covers the QoE metrics and rebuffering model.
- DASH Industry Forum — DASH-IF Interoperability Guidelines — canonical specification for the manifest format and segment addressing.
- “Bola: Near-Optimal Bitrate Adaptation for Online Videos” (SIGCOMM 2016) — the algorithm behind Netflix’s ABR player switching logic.
- Netflix Tech Blog — “Open Connect Everywhere” — details the OCA deployment model and ISP partnership economics.
- “A Survey of Adaptive Video Streaming with Reinforcement Learning” (IEEE 2020) — covers RL-based ABR approaches that Netflix and others have experimented with.