1. Hook #
In 2011, Dropbox engineers discovered that roughly 70% of all uploaded data was already on their servers — users syncing the same PDFs, stock photos, and installer packages. Switching from file-level to block-level deduplication immediately cut bandwidth costs by more than two-thirds. That insight defines the whole discipline of cloud file sync: the hard problems are not storage capacity or even bandwidth, but delta detection, deduplication, conflict resolution, and consistency across an arbitrarily large fleet of devices. Google Drive went further, embedding a collaborative editing layer (Docs, Sheets, Slides) on top of the same blob store. Today both systems handle hundreds of millions of users, billions of files, and near-real-time sync across mobile, desktop, and web clients — often over flaky connections.
2. Problem Statement #
Functional Requirements #
- Users can upload, download, update, delete, rename, and move files/folders.
- Changes on one device are synced to all other devices of the same user within seconds.
- Files can be shared with other users with configurable permissions (viewer / editor / owner).
- A full version history is maintained; users can roll back to any prior version.
- Large files upload efficiently even when the connection drops mid-transfer.
Non-Functional Requirements #
| Attribute | Target |
|---|---|
| Sync latency (small file, good connection) | < 5 s end-to-end |
| Upload resumability | Resume from last committed chunk after reconnect |
| Storage efficiency | Block-level deduplication across all users |
| Availability | 99.99% (< 53 min downtime/year) |
| Durability | 99.999999999% (11 nines) via multi-region replication |
| Concurrent editor conflict handling | Last-write-wins for binary files; OT (Operational Transformation) for Google Docs |
Out of Scope #
- Real-time collaborative editing internals (Google Docs OT engine)
- Mobile-specific delta-sync protocols (rsync-over-cellular optimisations)
- Virus scanning and DLP (Data Loss Prevention) pipelines
- Billing and quota enforcement
3. Scale Estimation #
Assumptions:
- 500M registered users; 50M Daily Active Users (DAU).
- Average storage per user: 15 GB → total corpus ~7.5 PB.
- Average file size: 1 MB; average daily churn: 5 files updated per DAU.
- Block size: 4 MB; deduplication hit rate: 60% (blocks already stored).
- Metadata reads heavily outweigh writes (read:write ≈ 10:1 on metadata layer).
| Metric | Calculation | Result |
|---|---|---|
| Upload QPS (Queries Per Second) | 50M DAU × 5 files / 86 400 s | ~2 900 QPS |
| Unique blocks written/day | 2 900 QPS × 40% unique × 1 block/file avg | ~100 M blocks/day |
| New block data/day | 100 M × 4 MB | ~400 TB/day raw ingest |
| Metadata reads | 2 900 × 10 | ~29 000 QPS |
| Storage corpus (with 3× replication) | 7.5 PB × 3 | ~22.5 PB |
| Bandwidth (uploads, post-dedup) | 400 TB / 86 400 s | ~37 Gbps sustained |
| Cache size (hot metadata, 20% of records) | 500 M files × 20% × 2 KB/record | ~200 GB |
These numbers assume a 60% dedup hit saves ~600 TB/day of storage and bandwidth compared to naive whole-file uploads.
4. High-Level Design #
The system separates metadata (file names, folder hierarchy, permissions, versions) from blob storage (the actual file bytes, stored as content-addressable blocks).
flowchart TD
subgraph Client["Desktop / Mobile Client"]
W[Watcher / File System Events]
CH[Chunker & Hasher]
SQ[Sync Queue]
end
subgraph Control["Control Plane"]
MS[Metadata Service]
NQ[Notification Service\nlong-poll / WebSocket]
end
subgraph Data["Data Plane"]
BL[Block Upload API\nPre-signed URLs]
BS[(Block Store\nS3 / GCS — content-addressed)]
CD[CDN Edge\nfor downloads]
end
subgraph Meta["Metadata Store"]
PG[(PostgreSQL / Spanner\nfile tree + versions)]
RC[(Redis\nhot metadata cache)]
end
W -->|file changed| CH
CH -->|block hashes| SQ
SQ -->|check-then-commit| MS
MS -->|blocks needed| BL
BL -->|PUT blocks| BS
MS -->|commit version| PG
PG -->|cache warm| RC
MS -->|push notification| NQ
NQ -->|wake up peers| Client
BS -->|download| CD
CD -->|deliver blocks| Client
Write path: Client detects a file change → chunks and hashes → asks Metadata Service which blocks are missing → uploads only missing blocks to Block Store via pre-signed URL → commits the new file version atomically → Notification Service pushes a delta to all other devices of that user.
Read / sync path: Peer device receives notification → fetches updated metadata → downloads missing blocks from CDN edge (or Block Store on cache miss) → reassembles file locally.
Component Roles #
| Component | Responsibility | Key Choice |
|---|---|---|
| Client Chunker | Split file into fixed-size (4 MB) or variable-size (CDC) blocks; compute SHA-256 per block | Content-Defined Chunking (CDC) gives better dedup on insertions |
| Metadata Service | File tree, version chain, sharing ACLs, block manifest per version | Strong consistency (Spanner / CockroachDB) for conflict-free commits |
| Block Store | Immutable, content-addressed blob store; never mutates a block | S3 / GCS with key = SHA-256 of block content |
| Notification Service | Fan-out change events to all connected devices of a user | Long-poll or Server-Sent Events (SSE); WebSocket for mobile |
| CDN Edge | Cache popular / recently-accessed blocks close to users | CloudFront / Fastly; cache key = block SHA-256 (immutable, infinite TTL) |
5. Deep Dive — Critical Components #
5a. Chunking & Deduplication #
The client splits each file into blocks. Fixed-size chunking (4 MB) is simple but fragile: inserting a byte near the start shifts all subsequent block boundaries, destroying the dedup hit. Content-Defined Chunking (CDC) using a rolling hash (Rabin fingerprint / Gear hash) finds natural split points, so an insertion only affects nearby blocks. Dropbox uses a custom CDC implementation; average chunk size ~4 MB, minimum 512 KB, maximum 8 MB.
Each block’s storage key is SHA-256(content). Before uploading, the client sends the full list of block hashes to the Metadata Service. The service responds with only the hashes it has not seen before — the client uploads only those blocks. This is server-side deduplication at the block level, and it works across all users (if two users upload the same ISO image, only one copy is stored).
5b. Block Upload with Pre-signed URLs #
Routing large binary blobs through the Metadata Service would waste application-tier resources. Instead:
- Client POSTs a list of missing block hashes to
POST /v1/blocks/check. - Metadata Service returns a list of pre-signed PUT URLs (one per missing block) valid for 15 minutes.
- Client PUTs each block directly to S3/GCS — no application server in the hot path.
- Client POSTs a commit request:
POST /v1/files/{id}/versionswith the full block manifest. - Metadata Service validates all blocks exist in the store, then atomically writes the new version row.
record BlockManifest(String fileId, long parentVersionId, List<String> blockSha256s) {}
record CommitRequest(BlockManifest manifest, String clientDeviceId, Instant clientMtime) {}
// Pseudocode — Metadata Service commit logic
@Transactional
public FileVersion commit(CommitRequest req) {
var missing = blockStore.findMissing(req.manifest().blockSha256s());
if (!missing.isEmpty()) throw new BlocksNotUploadedException(missing);
long newVersion = versionRepo.nextVersion(req.manifest().fileId());
var version = new FileVersion(
req.manifest().fileId(), newVersion,
req.manifest().blockSha256s(),
req.clientMtime(), Instant.now()
);
versionRepo.save(version);
notificationService.fanOut(req.manifest().fileId(), newVersion);
return version;
}5c. Sync Engine — Change Detection #
The desktop client runs a file-system watcher (FSEvents on macOS, inotify on Linux, ReadDirectoryChangesW on Windows). On change:
- Compute SHA-256 of the changed file (stream-hash, never load whole file in RAM).
- Compare against locally cached hash. If equal, skip (spurious event).
- Chunk the file; compare block hashes against the cached manifest.
- Upload only changed blocks; commit new version.
For large files, the chunker is idempotent — if the upload is interrupted, the commit never fires, and on reconnect the client re-checks which blocks are already in the store and uploads only the remainder. This gives resumable upload for free.
5d. Notification & Peer Sync #
When a commit completes, the Metadata Service publishes an event to a Kafka topic keyed by userId. The Notification Service consumes this stream and pushes a lightweight delta to all devices of that user that are currently connected:
{ "fileId": "f123", "version": 42, "timestamp": "2026-04-28T10:00:00Z" }The receiving device fetches the new block manifest, diffs it against its local manifest, downloads only missing blocks from the CDN, and reassembles the file.
6. Data Model #
files table
#
| Column | Type | Notes |
|---|---|---|
file_id |
UUID PK | Immutable identifier |
owner_user_id |
UUID FK | Owner for quota accounting |
parent_folder_id |
UUID FK | Nullable (root files) |
name |
VARCHAR(1024) | Display name; not part of storage key |
is_deleted |
BOOL | Soft delete; purged after 30-day trash TTL |
created_at |
TIMESTAMPTZ |
file_versions table
#
| Column | Type | Notes |
|---|---|---|
version_id |
BIGINT PK | Monotonically increasing per file |
file_id |
UUID FK | |
block_sha256s |
TEXT[] | Ordered list of block hashes |
size_bytes |
BIGINT | Sum of all block sizes |
client_mtime |
TIMESTAMPTZ | Device-local mtime at commit time |
committed_at |
TIMESTAMPTZ | Server commit timestamp |
device_id |
UUID | Which device created this version |
Indexes:
(file_id, version_id DESC)— fetch latest version, range-scan history.(owner_user_id, committed_at DESC)— “recent activity” feed.
Partitioning: Partition file_versions by committed_at month; old partitions are archived to cold storage after 12 months.
shares table
#
| Column | Type | Notes |
|---|---|---|
share_id |
UUID PK | |
file_id |
UUID FK | |
grantee_user_id |
UUID | Nullable; null = link share |
permission |
ENUM(‘viewer’,’editor’,‘owner’) | |
expires_at |
TIMESTAMPTZ | Nullable |
7. Trade-offs #
Chunking Strategy: Fixed-size vs CDC #
| Option | Pros | Cons | When to choose |
|---|---|---|---|
| Fixed 4 MB | Simple, deterministic split points | Poor dedup after insertions | Append-only files (logs) |
| CDC (Rabin/Gear) | High dedup even after mid-file edits | More CPU on client; variable chunk sizes complicate caching | General-purpose file sync |
Conclusion: CDC wins for user files where edits happen in the middle (documents, code). The extra CPU cost (< 50 ms for a 100 MB file on modern hardware) is negligible compared to bandwidth saved.
Consistency Model: Strong vs Eventual #
| Option | Pros | Cons | When to choose |
|---|---|---|---|
| Strong (Spanner / CockroachDB) | No lost updates; clean conflict detection | Higher write latency (~10 ms cross-region) | Metadata commits |
| Eventual (Cassandra) | Lower latency, easier horizontal scale | Requires client-side conflict merge logic | Block existence checks |
Conclusion: Use strong consistency for the metadata commit (file version row) — losing an update is catastrophic for user trust. Use eventual consistency for read-path caches and the block existence index.
Conflict Resolution: Last-Write-Wins vs Fork #
| Option | Pros | Cons | When to choose |
|---|---|---|---|
| Last-Write-Wins (LWW) | Simple; no user friction | Silently discards offline edits | Binary files (images, executables) |
| Conflict Fork (Dropbox model) | No data loss; user sees “conflicted copy” | User must manually merge | Text and binary files when offline editing detected |
| Operational Transformation (OT) | Seamless collaborative editing | Complex; requires operational log | Google Docs / real-time collab |
Conclusion: For binary files, create a “conflicted copy” on the loser’s device and commit both versions — no data loss. For collaborative documents, delegate to the OT engine (out of scope here).
8. Failure Modes #
| Component | Failure | Impact | Mitigation |
|---|---|---|---|
| Block Store (S3) | Regional outage | Downloads fail; uploads stall | Cross-region replication (S3 CRR); serve reads from replica; queue uploads locally |
| Metadata Service | DB primary failure | Commits blocked; reads may stall | Automatic failover (Spanner's multi-region leader election); read from replica during commit downtime |
| Notification Service | Missed push (client offline) | Peer device not synced until it reconnects | On reconnect, client polls for versions newer than its local watermark (cursor-based catch-up) |
| Upload (client side) | Network drop mid-upload | Partial blocks in store; commit never fires | Blocks are immutable; on reconnect, re-check missing blocks and upload only those; no dangling state |
| Dedup index (block SHA-256 registry) | Cache corruption / false positive | Client skips upload, block missing at read time | Block store is source of truth; dedup cache is advisory only; validate block existence at commit time |
| Hot user (large team folder) | Thousands of notification fan-outs per second | Notification Service overload | Coalesce events per user/folder within a 500 ms window before fanning out; rate-limit per shared folder |
9. Security & Compliance #
Authentication & Authorization (AuthN/AuthZ):
- OAuth 2.0 with PKCE (Proof Key for Code Exchange) for third-party apps. First-party clients use short-lived JWTs (JSON Web Tokens) signed with rotating RSA (Rivest–Shamir–Adleman) keys.
- Every API call is validated against an ACL (Access Control List) check in the Metadata Service before block URLs are issued. Sharing a folder grants read/write on all descendant files — evaluated lazily at request time, not materialised into every row.
Encryption:
- At rest: blocks in S3/GCS encrypted with AES-256 (Advanced Encryption Standard 256-bit); metadata DB encrypted with TDE (Transparent Data Encryption). Encryption keys managed by the cloud KMS (Key Management Service) with per-customer CMKs (Customer-Managed Keys) for enterprise tiers.
- In transit: TLS (Transport Layer Security) 1.3 everywhere; pre-signed block URLs expire in 15 minutes and are scoped to a single block hash.
Input Validation:
- Block SHA-256 hashes are validated server-side before issuing pre-signed URLs — a client cannot request a URL for an arbitrary key pattern.
- File names are Unicode-normalised and stripped of path traversal sequences (
../) before storage.
GDPR / Right to Erasure:
- Soft-delete moves files to trash; hard-delete after 30 days purges the file_versions rows. Because blocks are shared across users (dedup), a block is only physically deleted when its reference count drops to zero — tracked in a separate GC (Garbage Collection) job that runs nightly.
- Crypto-shredding for enterprise: encrypt each user’s blocks with a per-user DEK (Data Encryption Key); erasure = delete the DEK.
Audit Log:
- Immutable audit stream: every file operation (upload, download, share, delete, permission change) emitted to a WORM (Write Once Read Many) log (S3 Object Lock / BigQuery append-only table). Required for SOC 2 (System and Organization Controls 2) Type II and HIPAA (Health Insurance Portability and Accountability Act) enterprise customers.
Rate Limiting:
- Per-user upload/download throughput capped at the Metadata Service layer (token bucket, 100 MB/s default, configurable per tier). Prevents a single power user from monopolising Block Store bandwidth.
10. Observability #
RED Metrics (Rate / Errors / Duration) #
| Signal | Metric | Alert Threshold |
|---|---|---|
| Upload rate | uploads_committed_total (counter) |
< 50% of 7-day baseline → PagerDuty |
| Upload error rate | uploads_failed_total / uploads_attempted_total |
> 1% over 5 min |
| Commit latency (p99) | metadata_commit_duration_seconds |
> 2 s |
| Download error rate | block_download_errors_total |
> 0.1% over 5 min |
| Notification delivery lag | notification_lag_seconds (histogram) |
p95 > 10 s |
Saturation Metrics #
| Resource | Metric | Alert Threshold |
|---|---|---|
| Block Store | S3 request rate vs service quota | > 80% of quota |
| Metadata DB | Replication lag | > 5 s |
| Notification Service | Connection count | > 90% of max |
Business Metrics #
- Sync success rate: fraction of file changes that reach all devices within 30 s.
- Dedup ratio:
bytes_skipped / bytes_attempted— tracks storage efficiency. - p99 end-to-end sync latency: from client commit to peer reassembly; target < 30 s.
Tracing #
Distributed traces (OpenTelemetry) span the full upload path: client SDK → Metadata Service → Block Store → Notification Service → peer client. The version_id is the trace correlation ID; every service logs it, enabling timeline reconstruction for any sync incident.
11. Scaling Path #
Phase 1 — MVP (< 10K DAU) #
Single-region deployment. PostgreSQL for metadata (single primary). S3 for blocks. No CDN — clients download directly from S3. Monolithic backend service. No notification push — clients poll every 30 s.
What breaks first: Polling at 10K DAU generates 333 QPS of metadata reads — manageable but noisy.
Phase 2 — Growth (10K → 500K DAU) #
Replace polling with long-poll / SSE (Server-Sent Events) connections. Add Redis for hot metadata cache. Add CDN for block downloads. Introduce the check-then-commit block dedup flow. Split Metadata Service from Block Upload API.
What breaks first: Metadata DB write throughput. A single Postgres primary tops out around 5 000 write TPS (Transactions Per Second). At 500K DAU with 5 file changes/day, peak writes hit ~30 TPS — fine. But shared-folder fan-outs can spike this 100×.
Phase 3 — Scale (500K → 10M DAU) #
Introduce Spanner / CockroachDB for globally distributed metadata with strong consistency. Shard notification connections across a fleet of WebSocket servers, using Kafka as the fan-out backbone. Add a dedup bloom filter in the client to skip the server check for blocks the client has previously confirmed exist. Introduce async version garbage collection.
What breaks first: Notification fan-out for large shared folders (teams with thousands of members). Move to a hierarchical fan-out: folder-level subscription with coalescing.
Phase 4 — Hyperscale (10M → 500M DAU) #
Per-region block store with cross-region replication and intelligent routing (serve blocks from the region nearest to the device). Multi-cell metadata sharding by user_id range. Separate quota service. ML-driven prefetch: predict which files a user will open on their mobile device and pre-warm the CDN before they arrive.
What breaks first: Block store cold-start costs. Tiered storage (hot/warm/cold) with lifecycle policies moves infrequently-accessed blocks to cheaper tiers (S3 Glacier).
12. Enterprise Considerations #
Brownfield Integration:
- Large enterprises already run SharePoint, NFS (Network File System) shares, or on-prem NAS (Network-Attached Storage). Dropbox Business and Google Workspace offer on-prem sync agents that bridge the local file system to the cloud store without migrating all data at cutover.
Build vs Buy:
- Block Store: always buy (S3/GCS/Azure Blob). Building a durable, globally-replicated object store from scratch takes years.
- Metadata DB: Spanner for Google, CockroachDB or Aurora Global for others. Viable open-source option: PostgreSQL with Citus for sharding.
- CDN: Cloudfront, Fastly, or Akamai. The immutable block cache key (SHA-256) means CDN hit rates can exceed 90% for popular content.
- Notification: build in-house on top of Kafka + a WebSocket gateway; off-the-shelf solutions (Pusher, Ably) work for early stages.
Multi-Tenancy:
- Enterprise customers (e.g., a hospital using Google Workspace) require data residency (blocks stored only in the EU). Implement per-tenant storage class with a region affinity tag on the metadata row. Block upload routing respects the tag.
- Noisy-neighbour risk: a single enterprise team with 10 000 members generates massive notification fan-out on every commit. Rate-limit notifications per shared folder, coalesce within a 1 s window.
TCO (Total Cost of Ownership) Ballpark:
- Block storage: ~$0.023/GB/month (S3 Standard). At 7.5 PB active corpus: ~$172K/month storage.
- Dedup savings: 60% hit rate → effective cost on $0.009/GB-equivalent.
- Egress: $0.09/GB from S3. CDN offloads ~85% → effective egress ~$0.013/GB-equivalent.
- Compute (Metadata Service + Notification): ~$50K/month at 50M DAU scale.
Conway’s Law Implication: The clean split between Metadata Service and Block Store almost always maps to two separate engineering teams. The API boundary (block manifests, pre-signed URLs) becomes the contract between those teams — keep it stable and versioned.
13. Interview Tips #
- Clarify scale first: “How many users, average file size, what’s the expected change rate per user per day?” These numbers drive every sizing decision. A 10K-user startup and a 500M-user consumer product have totally different bottlenecks.
- Lead with chunking and dedup: Most candidates jump to “store files in S3” — the 10× more interesting answer is why you chunk first, what CDC buys you, and how server-side dedup cuts bandwidth costs. This is the differentiating insight.
- Don’t forget the client sync engine: Interviewers often probe “how does the desktop client know what changed?” Cover file system watchers, the local manifest cache, and how the sync queue batches rapid successive saves.
- Nail conflict resolution: “What happens when two devices edit the same file offline?” This is the canonical follow-up. Know the three options (LWW, conflict fork, OT) and when each is appropriate.
- Vocabulary that signals fluency: content-addressable storage, CDC (Content-Defined Chunking), pre-signed URLs, idempotent block upload, watermark-based catch-up sync, crypto-shredding for GDPR erasure, WORM audit log.
14. Further Reading #
- Dropbox Magic Pocket (2016): Dropbox’s engineering blog post on building their own block store to replace S3 — covers erasure coding, rack-aware placement, and the economics of going on-prem at exabyte scale.
- Google’s Colossus: The successor to GFS (Google File System) that underpins Google Drive’s blob layer. The original GFS paper (Ghemawat et al., SOSP 2003) remains the canonical reference for distributed file system design.
- rsync algorithm (Andrew Tridgell, 1996): The rolling-checksum delta-sync algorithm that inspired modern chunking approaches. Short and readable — understanding it deeply answers 80% of “how do you sync efficiently over a slow link?” questions.
- CAP Theorem (Brewer, 2000): The theoretical foundation for the consistency trade-off between the Metadata Service (CP) and the block existence cache (AP).