1. Hook #
Every minute, creators upload 500 hours of video to YouTube — roughly 720,000 hours of raw footage per day that must be validated, transcoded into 10+ adaptive formats, and made globally available before viewers ever click play. Unlike Netflix (a closed catalogue of licensed titles transcoded offline), YouTube is a live upload platform: a creator in Lagos hits “publish” and expects global playback within minutes. The upload pipeline, transcoding infrastructure, and two-tier CDN (Content Delivery Network) that make this possible are among the most complex media-engineering systems on the planet. On the consumption side, 2 billion+ logged-in users watch over 1 billion hours of video daily — a recommendation challenge that dwarfs most advertising systems in latency sensitivity and business impact. If the recommendation model serves the wrong video, engagement drops; if the transcoder stalls, creators lose monetisation time.
2. Problem Statement #
Functional Requirements #
- Creators can upload videos (any container/codec, up to 12 hours / 256 GB).
- Uploaded videos are transcoded into multiple resolutions and codecs (AVC / VP9 / AV1) within minutes of upload.
- Viewers can stream any video at adaptive bitrates across devices worldwide.
- Viewers can search by keyword, title, channel, or topic.
- Viewers can post, vote on, and read threaded comments.
- The homepage and sidebar serve personalised video recommendations in < 200 ms.
Non-Functional Requirements #
| Attribute | Target |
|---|---|
| Transcode completion (p95) | < 5 min after upload (standard HD), < 30 min (4K HDR) |
| Playback start latency (p95) | < 2 s |
| Video availability after publish | < 10 min globally |
| Platform availability | 99.99% (< 53 min downtime/year) |
| Search freshness | < 15 min after publish |
| Comment write latency | < 500 ms |
| Recommendation API latency (p99) | < 200 ms |
Out of Scope #
- Live streaming (YouTube Live is a separate RTMP/WebRTC ingest path)
- Creator monetisation and ad serving
- Copyright detection (Content ID is a separate fingerprinting system)
- Billing and channel memberships
3. Scale Estimation #
Assumptions:
- 2B logged-in users; 500M Daily Active Users (DAU).
- 500 hours of video uploaded per minute → ~30,000 minutes/min raw upload.
- Average upload size: 1 GB per 5-minute video → ~200 MB/min of raw video.
- Average transcoded output per video: 10 format/resolution variants, ~500 MB total.
- 1B hours of watch time per day; average session 6 minutes → ~10B video views/day.
- Comment volume: 100M comments/day.
| Metric | Calculation | Result |
|---|---|---|
| Upload writes (raw) | 500 hrs/min × 60 min/hr × 200 MB/min | ~6 TB/min raw ingest |
| Transcoder output/day | 720,000 video-hours × 500 MB/video-hr | ~360 TB/day |
| Stored video (5-year corpus) | 500 hrs/min × 526,000 min/yr × 5 yr × 500 MB | ~660 PB |
| Read QPS (video manifest) | 10B views / 86,400 s | ~115,000 QPS |
| Peak CDN egress | 500M concurrent viewers × 4 Mbps avg | ~2 Pbps at peak |
| Comment writes/s | 100M / 86,400 | ~1,160 writes/s |
| Comment reads/s | ×50 read/write ratio | ~58,000 reads/s |
| Search QPS | 500M DAU × 4 searches/day / 86,400 | ~23,000 QPS |
| Recommendation calls/s | 500M DAU × 10 impressions/session / 86,400 | ~58,000 QPS |
The CDN egress (~2 Pbps) is the dominant constraint, pushing YouTube to operate a two-tier CDN with ISP-embedded edge caches, similar in philosophy to Netflix’s OCA (Open Connect Appliances).
4. High-Level Design #
The system decomposes into four independent planes: the upload & transcode plane (raw ingest → encoding DAG → object storage), the streaming plane (manifest API → CDN tier → ABR (Adaptive Bitrate) player), the engagement plane (comments, likes, view counts), and the discovery plane (search, recommendations).
flowchart TD
subgraph Creator["Creator"]
UP["Upload Client\n(browser / app)"]
end
subgraph Ingest["Upload & Transcode Plane"]
ULS["Upload Service\n(resumable, chunked)"]
RAW["Raw Object Store\n(GCS)"]
TQ["Transcode Job Queue\n(Pub/Sub)"]
TC["Transcode Workers\n(FFmpeg fleet)"]
ENC["Encoded Segments\n(GCS — DASH / HLS)"]
META["Video Metadata DB\n(Spanner)"]
end
subgraph Stream["Streaming Plane"]
MAPI["Manifest API"]
L1["Tier-1 Edge Cache\n(ISP PoP)"]
L2["Tier-2 Regional Cache\n(Google PoP)"]
ORI["Origin Servers\n(GCS signed URL)"]
end
subgraph Discovery["Discovery Plane"]
SRCH["Search Service\n(inverted index)"]
REC["Recommendation Service\n(two-tower model)"]
FEED["Homepage API"]
end
subgraph Engage["Engagement Plane"]
CMT["Comment Service"]
VCNT["View Count Service\n(Bigtable)"]
end
UP -->|"Resumable PUT (chunks)"| ULS
ULS --> RAW
RAW --> TQ
TQ --> TC
TC --> ENC
TC --> META
ENC --> L2
L2 --> L1
MAPI --> L1
L1 -->|cache miss| L2
L2 -->|cache miss| ORI
ORI --> ENC
FEED --> REC
FEED --> SRCH
CMT --> META
UP -->|"publish event"| SRCH
UP -->|"publish event"| REC
Write path (upload): Creator client uploads in 5 MB chunks via a resumable upload protocol to the Upload Service. Once all chunks land in raw object storage, a job message is pushed to the Transcode Queue. A Transcode Worker picks up the job, runs parallel FFmpeg processes for each output format, and writes encoded MPEG-DASH (Dynamic Adaptive Streaming over HTTP) and HLS (HTTP Live Streaming) segments back to object storage. The Metadata DB is updated; a publish event triggers Search indexing and Recommendation model updates.
Read path (streaming): The player fetches a manifest (MPD or M3U8) from the Manifest API, which returns pre-signed segment URLs pointing at the nearest Tier-1 edge cache (embedded at the ISP). Cache misses fall through to Tier-2 Google PoP caches, then to origin object storage. The ABR player selects the appropriate bitrate variant every few seconds based on measured throughput.
| Component | Technology | Role |
|---|---|---|
| Upload Service | Custom gRPC (Google Remote Procedure Call) + GCS resumable API | Chunked ingest, deduplication, virus scan |
| Raw & Encoded Store | Google Cloud Storage (GCS) | Durable object storage, region-replicated |
| Transcode Queue | Google Pub/Sub | Durable job fan-out to transcode workers |
| Transcode Workers | FFmpeg on preemptible VMs / custom ASICs | Parallel encode: H.264, VP9, AV1 at 360p–4K |
| Video Metadata DB | Cloud Spanner | Globally consistent video metadata, channel info |
| View Count Store | Cloud Bigtable | High-write counter sharding, eventual consistency |
| Comment Store | Spanner + Bigtable | Threaded comments, votes, moderation state |
| CDN Tier-1 (Edge) | Google Edge Cache at ISP PoP | Last-mile segment delivery, 90%+ hit rate |
| CDN Tier-2 (Regional) | Google PoP (~180 global) | Region-level cache, reduces origin load |
| Search | Internal inverted index (Google Search infra) | Full-text search, autocomplete, query intent |
| Recommendation | Two-tower deep neural network (DNN) | Candidate retrieval + ranking at < 200 ms |
5. Deep Dive #
5.1 Upload Pipeline — Resumable Chunked Upload #
Large uploads (multi-GB raw files) fail frequently on mobile. YouTube uses a resumable upload protocol: the client first POSTs metadata to obtain an upload session URI, then streams 5 MB chunks with byte-range headers. If the connection drops, the client queries the upload service for the last confirmed byte offset and resumes from there. Only when all chunks are acknowledged does the upload service write the assembled file to raw GCS and emit a VideoUploaded event.
// Simplified resumable upload session handler
public record UploadSession(
String sessionId,
String videoId,
String rawGcsPath,
long totalBytes,
long confirmedBytes,
Instant expiresAt
) {}
@PutMapping("/upload/{sessionId}")
public ResponseEntity<Void> uploadChunk(
@PathVariable String sessionId,
@RequestHeader("Content-Range") String contentRange,
InputStream body) {
UploadSession session = sessionStore.get(sessionId)
.orElseThrow(() -> new SessionNotFoundException(sessionId));
var range = ContentRange.parse(contentRange); // e.g. bytes 0-5242879/209715200
if (range.start() != session.confirmedBytes()) {
return ResponseEntity.status(308) // Resume Incomplete
.header("Range", "bytes=0-" + (session.confirmedBytes() - 1))
.build();
}
gcsClient.writeChunk(session.rawGcsPath(), range.start(), body, range.length());
long updated = session.confirmedBytes() + range.length();
sessionStore.updateConfirmed(sessionId, updated);
if (updated >= session.totalBytes()) {
pubSub.publish("video-uploaded", new VideoUploadedEvent(session.videoId()));
return ResponseEntity.ok().build();
}
return ResponseEntity.status(308)
.header("Range", "bytes=0-" + (updated - 1))
.build();
}5.2 Transcoding DAG #
Each upload triggers a transcoding DAG (Directed Acyclic Graph) with these stages in parallel:
- Demux & validate — container inspection (MP4/MOV/MKV), codec detection, duration, resolution.
- Per-format encode — for each of ~10 output profiles (360p/AVC, 480p/AVC, 720p/VP9, 1080p/VP9, 1080p/AV1, 1440p/AV1, 2160p/AV1 + audio variants), FFmpeg runs on a dedicated preemptible VM.
- Segment & package — output is split into 2-second DASH segments and 6-second HLS segments; manifests (MPD and M3U8) are generated.
- Pre-warm CDN — encoded segments for the lowest bitrates (360p, 480p) are immediately pushed to Tier-2 regional caches so the video is playable before 4K encoding finishes.
YouTube’s internal transcoder was later extended with custom ASICs (Application-Specific Integrated Circuits) to encode AV1 — a compute-heavy codec — at 10× lower cost per video-minute than GPU-based FFmpeg.
5.3 Adaptive Bitrate Streaming #
The DASH player measures available throughput every 2 seconds. It maintains a buffer goal (e.g., 30 seconds ahead). If the buffer is healthy, it requests a higher bitrate segment next; if bandwidth drops, it switches to a lower bitrate without stalling. This bitrate decision lives entirely on the client — the server just needs to serve segments fast.
The manifest API returns a signed URL per segment, valid for 1 hour, pointing at the nearest Tier-1 ISP cache. Segment files are immutable (content-addressed by video ID + format + timestamp), making CDN caching trivial.
5.4 View Count at Scale #
Naive view count increments against a single counter row would saturate any database. YouTube shards the counter:
- Write path: each view event is published to a Bigtable row keyed by
videoId#shardId(100 shards per video). Writers round-robin across shards. - Read path (approximate): a background job periodically sums the 100 shard rows and writes the aggregate to a separate
videoId#totalrow. Display queries read the aggregate; real-time shard reads are only done for fraud detection. - Write amplification cap: heavily viral videos may spike to millions of writes/s. Rate-limited batching at the app layer collapses nearby events into one Bigtable increment per 100 ms window per shard.
6. Data Model #
Video Metadata (Cloud Spanner) #
| Column | Type | Notes |
|---|---|---|
video_id |
STRING(22) | Base64url random, PK |
channel_id |
STRING(22) | FK → channel table |
title |
STRING(200) | Full-text indexed externally |
description |
STRING(5000) | |
status |
ENUM | UPLOADING / PROCESSING / LIVE / DELETED |
duration_s |
INT64 | Seconds |
formats_ready |
ARRAY |
e.g. ["360p","720p"], updated per transcode job |
thumbnail_url |
STRING(500) | GCS URL |
published_at |
TIMESTAMP | Index for feed freshness |
view_count_approx |
INT64 | Updated by batch aggregation |
tags |
ARRAY |
Propagated to search index |
Indexes: (channel_id, published_at DESC) for channel feed; published_at DESC for trending; status for ops dashboards.
Comment Thread (Spanner) #
| Column | Type | Notes |
|---|---|---|
comment_id |
STRING(22) | PK |
video_id |
STRING(22) | FK, interleaved in video parent table |
parent_comment_id |
STRING(22) | NULL for top-level |
author_id |
STRING(22) | |
body |
STRING(10000) | |
like_count |
INT64 | |
created_at |
TIMESTAMP | |
moderation_state |
ENUM | PENDING / APPROVED / REMOVED |
Interleaving on video_id means Spanner co-locates all comments for a video on the same split, making “load top comments for video” a single-range scan.
View Count (Bigtable) #
| Row key | Column family | Qualifier | Value |
|---|---|---|---|
videoId#shardN |
cnt |
v |
INT64 delta |
videoId#total |
cnt |
v |
INT64 aggregate |
7. Trade-offs #
7.1 DASH vs HLS Delivery #
| DASH | HLS | |
|---|---|---|
| Standard | ISO MPEG-DASH (open) | Apple proprietary |
| DRM (Digital Rights Management) | Widevine / PlayReady natively | FairPlay (Apple), SAMPLE-AES |
| Browser support | Chrome, Firefox, Edge (via MSE) | Safari native, others via JS |
| Segment size | 2 s (lower latency) | 6 s (safer for CDN caching) |
Conclusion: YouTube serves both formats. DASH for Android/Chrome/desktop; HLS for iOS/Safari. The transcoder produces both from the same encoded segments.
7.2 AV1 vs VP9 vs H.264 #
| Codec | Compression gain over H.264 | Encode cost | Decode support |
|---|---|---|---|
| H.264 / AVC | — (baseline) | 1× | Universal |
| VP9 | ~30% better | 5× | Chrome, Android, smart TVs |
| AV1 | ~50% better | 20× (SW) / 2× (ASIC) | Chrome, Android 10+, some TVs |
Conclusion: YouTube encodes all three. H.264 is the safe fallback; VP9 is the default for desktop/Android; AV1 is rolled out where decode support exists and saves ~50% bandwidth (which at 2 Pbps is enormous). The custom AV1 ASIC investment was justified by the CDN bandwidth savings alone.
7.3 Push vs Pull Pre-warm on CDN #
| Strategy | Latency after publish | Origin load | Storage waste |
|---|---|---|---|
| Push (pre-warm all formats) | Near-zero | Low | High — most videos get < 100 views |
| Pull (lazy on first request) | First viewer pays cache-fill latency | Spiky | None |
| Hybrid (push lowest bitrates only) | Near-zero for 360p/480p | Moderate | Low |
Conclusion: YouTube pre-warms only the two lowest bitrate variants immediately after encode. Higher bitrates are pulled on demand and cached at Tier-2. The vast majority of views happen on popular videos that will fill Tier-1 quickly via pull.
7.4 Eventual vs Strong Consistency for View Counts #
Strong consistency on a global counter requires distributed transactions — latency in the hundreds of milliseconds. YouTube tolerates approximate counts (±0.5%) in exchange for < 5 ms write latency. The exact count is reconciled nightly for monetisation payouts where accuracy matters.
8. Failure Modes #
| Component | Failure | Impact | Mitigation |
|---|---|---|---|
| Transcode Worker | VM preempted mid-job | Video stuck in PROCESSING | Pub/Sub nack → retry with exponential backoff; dead-letter queue (DLQ) after 5 attempts; ops alert on DLQ depth |
| CDN Tier-1 node | ISP cache node goes down | Viewers in that ISP fall back to Tier-2; latency spike | Anycast routing automatically fails over to nearest healthy Tier-2 PoP; no creator action needed |
| View Count Bigtable | Hot partition on viral video | Write latency spike, potential throttling | 100-shard key design caps per-shard write rate; Bigtable auto-splits hot tablets |
| Recommendation Service | Model serving latency spike | Homepage shows stale/generic recs | Circuit breaker returns pre-computed top-N popular videos as fallback within 50 ms SLA |
| Metadata Spanner | Regional outage | Publish/update writes fail; reads degrade | Spanner multi-region config; reads from any replica; writes wait for quorum (< 100 ms added latency in normal operation) |
| Upload Service | Thundering herd — viral event causes upload spike | Upload latency, transcode queue depth grows | Pub/Sub backpressure; transcode fleet autoscales on queue depth metric; priority queue for monetised channels |
9. Security & Compliance #
AuthN/AuthZ (Authentication/Authorisation):
- Upload requires OAuth 2.0 bearer token (Google Identity); creator’s channel_id is embedded in the session and validated on every chunk.
- Private and unlisted videos: manifest API checks ACL (Access Control List) before returning segment URLs; signed URLs expire in 1 hour.
Encryption:
- In transit: TLS (Transport Layer Security) 1.3 for all client-facing connections; internal GCS traffic uses Google-managed encryption.
- At rest: GCS server-side AES (Advanced Encryption Standard)-256 encryption; CMEK (Customer-Managed Encryption Keys) available for enterprise.
Content Safety:
- Every upload is scanned by the Content Safety API (CSAM hash matching, known-bad signature DB) before the
VideoUploadedevent fires. - A machine learning classifier runs asynchronously and may suppress public visibility pending human review.
Input Validation:
- File type validation (magic bytes, not just extension) before raw write to GCS.
- Metadata fields (title, description, tags) are HTML-escaped and length-capped server-side.
- Rate limiting per channel_id: 50 uploads/hour, enforced at the Upload Service via Redis sliding window.
GDPR (General Data Protection Regulation) / Privacy:
- Right-to-erasure: deleting a channel triggers an async job that removes video data from GCS, CDN cache purge, and tombstones the Spanner row. Comment bodies are replaced with
[deleted]within 30 days. - PII (Personally Identifiable Information) in comments is subject to NLP-based scanning; comments may be redacted under GDPR Article 17 requests.
Audit Trail:
- All upload, publish, and deletion events are appended to an immutable audit log (BigQuery streaming insert with
insertIddeduplication), retained 7 years for compliance.
10. Observability #
RED Metrics (Rate, Errors, Duration):
| Signal | Metric | Alert Threshold |
|---|---|---|
| Upload Rate | upload_sessions_started / min |
Drop > 20% from baseline → PagerDuty |
| Transcode Error Rate | transcode_jobs_failed / total |
> 0.5% → P2 alert |
| Transcode Duration (p95) | transcode_duration_p95_s |
> 600 s for 1080p → P2 |
| Playback Start Rate | playback_success / total_attempts |
< 99% → P1 alert |
| CDN Hit Rate | cache_hits / total_segment_requests |
< 90% → investigate origin load |
| Recommendation Latency (p99) | rec_api_latency_p99_ms |
> 250 ms → scale recommendation pods |
Saturation Metrics:
- Transcode queue depth per priority tier (target: < 1,000 pending jobs)
- Bigtable tablet CPU utilisation (target: < 70%)
- CDN Tier-1 egress utilisation per ISP node (target: < 80%)
Business Metrics:
- Creator upload-to-live latency (p50 / p95 / p99) — creator satisfaction KPI (Key Performance Indicator)
- Viewer rebuffering ratio per region per device type
- Recommendation CTR (Click-Through Rate) and watch time per served impression
Tracing:
- Distributed trace spans (OpenTelemetry) propagated through Upload → Pub/Sub → Transcode Worker → CDN Pre-warm chain. Sampled at 1% in production; 100% for failed jobs.
- Jaeger / Cloud Trace stores 15-day retention; critical paths instrumented with structured logging correlated by
trace_id.
11. Scaling Path #
Phase 1 — MVP (< 1K uploads/day, < 100K views/day)
- Single upload service, FFmpeg on a few VMs, PostgreSQL for metadata, single-region object storage.
- No CDN — serve video directly from object storage signed URLs.
- What breaks first: FFmpeg queue depth grows linearly with upload rate; manual scaling needed above ~50 concurrent uploads.
Phase 2 — 10K uploads/day, 10M views/day
- Introduce Pub/Sub for transcode job fan-out; autoscale transcode fleet on queue depth.
- Add a CDN in front of object storage to absorb read traffic; cache hit rate reduces origin egress 80%+.
- Migrate metadata to Spanner for global consistency and scale.
- What breaks first: view counts on popular videos hammer a single row → introduce Bigtable shard counter.
Phase 3 — 100K uploads/day, 100M views/day
- Two-tier CDN (ISP PoP caches + regional PoP); pre-warm popular/trending videos.
- Introduce VP9 encoding; AV1 R&D begins.
- Recommendation model replaces heuristic popularity-based ranking; candidate retrieval separated from ranking for latency.
- What breaks first: search index freshness degrades under ingestion load → dedicate search indexing pipeline with 5-min SLO.
Phase 4 — 500+ hours/min uploads, 1B+ daily views (current scale)
- Custom AV1 ASICs for encoding at 2× cost efficiency vs GPU.
- Tens of thousands of ISP PoP nodes globally; Anycast routing for lowest RTT (Round-Trip Time).
- Recommendation model: multi-task DNN (Deep Neural Network) trained on watch time, click-through, satisfaction survey signals.
- Structured concurrency for upload session management (Java 21 virtual threads handle millions of simultaneous upload sessions cheaply).
- What breaks first: global metadata consistency latency at Spanner becomes visible for creator dashboards → introduce read caching layer with 30-second TTL (Time-To-Live) for analytics queries.
12. Enterprise Considerations #
Brownfield Integration: An enterprise deploying a private YouTube-like platform (internal training videos, compliance recordings) would integrate with existing IAM (Identity and Access Management) providers (Okta, Azure AD) via SAML (Security Assertion Markup Language) 2.0 or OIDC (OpenID Connect). The video metadata store needs to feed existing search (Elasticsearch) and DMS (Document Management Systems) for compliance discovery.
Build vs Buy:
| Component | Build | Buy |
|---|---|---|
| Transcoding | FFmpeg (open-source) + custom orchestration | AWS Elemental MediaConvert, Mux |
| CDN | Internal ISP PoP nodes (YouTube scale only) | Cloudflare, Fastly, Akamai |
| Object Storage | GCS / S3 (effectively buy) | — |
| Recommendation | Build (core IP) | AWS Personalize, Vertex AI |
| Search | Build for YouTube scale | Elasticsearch for < 100M videos |
Multi-Tenancy:
- Namespace all storage paths by
tenantId/videoIdfor data isolation. - Separate transcode queues per tenant tier (premium channels get priority queue).
- Rate limits and storage quotas enforced per
channelIdat the API gateway.
TCO (Total Cost of Ownership) Ballpark: At YouTube’s scale, CDN egress dominates cost. Moving from VP9 to AV1 saves ~$500M/year in CDN bandwidth cost at 2 Pbps egress and $0.01/GB rates (order-of-magnitude estimate). Custom ASIC investment of ~$50M amortised over 5 years pays back in < 3 months of bandwidth savings.
Conway’s Law: YouTube’s separate teams for Upload, Transcode, CDN, Recommendation, and Search directly mirror the service decomposition. The upload → transcode → CDN pre-warm boundary is an organisational boundary as much as an architectural one.
13. Interview Tips #
- Start with the upload pipeline: Most interviewers care more about “how does video get processed” than the CDN. Walk through chunked upload → Pub/Sub → transcoding DAG → segment storage → CDN pre-warm in the first 10 minutes.
- Clarify encoding requirements early: Ask “do we need multi-resolution support?” — this drives the transcoding DAG complexity. If yes, explain parallel per-format encoding as a fan-out pattern.
- Call out the view count hot partition problem: This is a canonical high-write counter problem. Mention counter sharding (N shards per video, periodic aggregation) before the interviewer brings it up.
- Recommendation system depth: If asked to go deep, describe the two-tower architecture — a user embedding tower and a video embedding tower trained jointly, ANN (Approximate Nearest Neighbours) retrieval for candidate generation, then a ranking model. Mention that retrieval and ranking are split for latency budget reasons.
- Common mistake: Designing the streaming path to return raw GCS URLs from the metadata DB. The Manifest API should return CDN-prefixed, signed, short-TTL URLs so clients never bypass the cache layer.
- Fluency vocabulary: Use “DASH manifest”, “ABR switching”, “transcode DAG”, “CDN pre-warm”, “counter shard”, “DLQ (Dead-Letter Queue)”, “ANN retrieval”, “two-tower DNN”.
14. Further Reading #
- Paper: Covington et al., Deep Neural Networks for YouTube Recommendations (RecSys 2016) — the canonical two-tower retrieval + ranking paper, still the foundation of modern video recommendations.
- Engineering Blog: YouTube Engineering — AV1 at Scale — covers the ASIC investment and encoding cost savings.
- RFC: RFC 8216 — HTTP Live Streaming (HLS) specification, covering segment format, manifest structure, and encryption.