
[{"content":"System design fundamentals\n","date":"18 April 2026","externalUrl":null,"permalink":"/posts/system-design-basics/","section":"Posts","summary":"System design fundamentals\n","title":"System Design Basics","type":"posts"},{"content":"Java language features, JVM internals, and platform evolution from Java 8 to 21.\n","date":"18 April 2026","externalUrl":null,"permalink":"/posts/java/","section":"Posts","summary":"Java language features, JVM internals, and platform evolution from Java 8 to 21.\n","title":"Java","type":"posts"},{"content":"Spring Boot and Spring Framework evolution, trade-offs, and migration guides.\n","date":"18 April 2026","externalUrl":null,"permalink":"/posts/spring/","section":"Posts","summary":"Spring Boot and Spring Framework evolution, trade-offs, and migration guides.\n","title":"Spring","type":"posts"},{"content":"All posts on engineering, system design, Java, Spring, and leadership.\n","date":"18 April 2026","externalUrl":null,"permalink":"/system-design/classic/","section":"System designs - 100+","summary":"All posts on engineering, system design, Java, Spring, and leadership.\n","title":"Classic","type":"system-design"},{"content":"All posts on engineering, system design, Java, Spring, and leadership.\n","externalUrl":null,"permalink":"/posts/","section":"Posts","summary":"All posts on engineering, system design, Java, Spring, and leadership.\n","title":"Posts","type":"posts"},{"content":"100 system design questions\n","externalUrl":null,"permalink":"/system-design/","section":"System designs - 100+","summary":"100 system design questions\n","title":"System designs - 100+","type":"system-design"},{"content":"","date":"18 April 2026","externalUrl":null,"permalink":"/tags/base62/","section":"Tags","summary":"","title":"Base62","type":"tags"},{"content":"","date":"18 April 2026","externalUrl":null,"permalink":"/tags/caching/","section":"Tags","summary":"","title":"Caching","type":"tags"},{"content":"","date":"18 April 2026","externalUrl":null,"permalink":"/categories/","section":"Categories","summary":"","title":"Categories","type":"categories"},{"content":"","date":"18 April 2026","externalUrl":null,"permalink":"/categories/Classic/","section":"Categories","summary":"","title":"Classic","type":"categories"},{"content":"","date":"18 April 2026","externalUrl":null,"permalink":"/tags/distributed-systems/","section":"Tags","summary":"","title":"Distributed-Systems","type":"tags"},{"content":"","date":"18 April 2026","externalUrl":null,"permalink":"/tags/scalability/","section":"Tags","summary":"","title":"Scalability","type":"tags"},{"content":"","date":"18 April 2026","externalUrl":null,"permalink":"/categories/System-Design/","section":"Categories","summary":"","title":"System Design","type":"categories"},{"content":" Classic \u0026ldquo;Design X\u0026rdquo; Questions # # Topic Category Status Published URL Notes 1 URL Shortener (bit.ly) Classic published /system-design/classic/url-shortener Base62, hash collisions, redirect latency 2 Twitter / Social Media Feed Classic todo Fan-out on write vs read, timeline 3 Instagram Classic todo Photo storage, explore, hashtag index 4 WhatsApp / Chat Messaging System Classic todo WebSocket, message ordering, E2EE 5 Uber / Ride-Sharing System Classic todo Geohash, supply-demand matching 6 Netflix / Video Streaming Platform Classic todo CDN, adaptive bitrate, personalisation 7 YouTube Classic todo Upload pipeline, transcoding, comments 8 Dropbox / Google Drive (File Sync) Classic todo Block dedup, delta sync, chunking 9 Google Docs (Real-Time Collaborative Editing) Classic todo OT vs CRDT, conflict-free merges 10 Search Engine (Google-scale) Classic todo Crawl → index → rank → serve 11 Google Maps / Routing Engine Classic todo Dijkstra, A*, road graph sharding 12 Web Crawler Classic todo Politeness, dedup, frontier scheduling 13 Recommendation System (end-to-end) Classic todo Collaborative filtering, two-tower, serving 14 Notification Service (email, push, SMS at scale) Classic todo Fanout, deduplication, delivery tracking 15 Rate Limiter Classic todo Token bucket, sliding window, Redis 16 Distributed Cache Classic todo Eviction policies, clustering, consistency 17 Key-Value Store (Redis / DynamoDB internals) Classic todo LSM tree, WAL, consistent hashing 18 Distributed Message Queue (Kafka) Classic todo Partitions, offsets, consumer groups 19 Logging \u0026amp; Metrics System (Datadog / ELK) Classic todo Structured logs, TSDB, alerting 20 Distributed File System (HDFS / GFS) Classic todo NameNode, replication, rack awareness Fintech # # Topic Category Status Published URL Notes 21 Payment Processing System Fintech todo Idempotency, saga, PCI DSS 22 Digital Wallet Fintech todo Balance model, top-up, withdrawal 23 Money Transfer System (Venmo / Wise / Moniepoint) Fintech todo Cross-border, FX, settlement rails 24 Fraud Detection System Fintech todo Rule engine + ML, real-time scoring 25 Card Authorization System Fintech todo Issuer, network, sub-100ms auth 26 Ledger / Double-Entry Bookkeeping System Fintech todo Immutable entries, balance integrity 27 Reconciliation System (two financial systems) Fintech todo Eventual consistency, diff engine 28 KYC / AML Onboarding Flow Fintech todo Watchlist, PEP screening, risk scoring 29 High-Throughput Transaction Processing System Fintech todo LMAX Disruptor, mechanical sympathy 30 Currency Conversion System Fintech todo FX rate feed, rounding, audit trail 31 Banking Core — Account Balances at Scale Fintech todo ACID at scale, regulatory reporting E-Commerce \u0026amp; Marketplace # # Topic Category Status Published URL Notes 32 Amazon — Catalog, Search \u0026amp; Checkout E-Commerce todo Product graph, ranking, checkout saga 33 Flash Sale / High-Contention Inventory E-Commerce todo Thundering herd, queue, fairness 34 Shopping Cart (multi-session, multi-device) E-Commerce todo Merge strategies, guest → auth 35 Inventory Management System (multi-warehouse) E-Commerce todo Oversell prevention, reservation TTL 36 Price Comparison Engine E-Commerce todo Crawl-and-normalize, ranking, freshness 37 Coupon \u0026amp; Promotion Engine E-Commerce todo Rule DSL, stacking, abuse prevention 38 Airbnb — Search, Booking \u0026amp; Availability E-Commerce todo Geo search, calendar blocking, pricing 39 Order Management System E-Commerce todo State machine, fulfilment pipeline 40 Loyalty / Points \u0026amp; Rewards System E-Commerce todo Ledger, expiry, redemption Booking \u0026amp; Reservation # # Topic Category Status Published URL Notes 41 Movie Ticket Booking System (BookMyShow) Booking todo Seat lock, payment window, concurrency 42 Hotel Booking System Booking todo Availability calendar, overbooking policy 43 Airline Reservation System Booking todo PNR, seat classes, fare rules 44 Restaurant Reservation System (OpenTable) Booking todo Table inventory, waitlist, no-show 45 Calendar \u0026amp; Scheduling System (Calendly) Booking todo Availability slots, timezone, conflict Real-Time \u0026amp; Streaming # # Topic Category Status Published URL Notes 46 Live Video Streaming Platform (Twitch) Real-Time todo Ingest, transcode, CDN edge, chat 47 Live Sports Scores System Real-Time todo Push vs poll, SSE, fan-out 48 Online Multiplayer Game Backend Real-Time todo State sync, authoritative server, lag comp 49 Stock Trading Platform Real-Time todo Order matching, LMAX, market data feed 50 Real-Time Analytics Dashboard Real-Time todo Kafka + Flink, OLAP, query latency 51 Collaborative Whiteboard Real-Time todo CRDT, WebSocket, cursor presence Infrastructure \u0026amp; Developer Tools # # Topic Category Status Published URL Notes 52 CI/CD System Infra \u0026amp; DevTools todo Pipeline DAG, artifact store, rollback 53 Feature Flag Service (LaunchDarkly) Infra \u0026amp; DevTools todo Progressive rollout, targeting rules 54 Configuration Management System Infra \u0026amp; DevTools todo Hot reload, versioning, audit 55 Secrets Management System (Vault) Infra \u0026amp; DevTools todo Dynamic secrets, lease renewal, KMS 56 API Gateway Infra \u0026amp; DevTools todo Auth, routing, throttling, observability 57 Service Registry \u0026amp; Discovery Infra \u0026amp; DevTools todo Consul, Eureka, health checks 58 Distributed Job Scheduler (cron at scale) Infra \u0026amp; DevTools todo Exactly-once, leader election, sharding 59 Workflow Engine (Airflow / Temporal) Infra \u0026amp; DevTools todo DAG execution, retries, durable state Content \u0026amp; Media # # Topic Category Status Published URL Notes 60 Content Delivery Network (CDN) Content \u0026amp; Media todo PoP placement, cache hierarchy, purge 61 Image Hosting \u0026amp; Serving System Content \u0026amp; Media todo On-the-fly resize, WebP, CDN offload 62 Podcast Hosting Platform Content \u0026amp; Media todo Audio storage, RSS, analytics 63 News Feed Aggregator Content \u0026amp; Media todo RSS crawl, dedup, personalisation 64 Content Moderation System Content \u0026amp; Media todo ML classifier + human review pipeline 65 Comment System at Scale Content \u0026amp; Media todo Threading, voting, spam, hot content AI / ML Systems # # Topic Category Status Published URL Notes 66 Recommendation System — Full Pipeline AI/ML todo Candidate gen → ranking → serving 67 LLM-Powered Chatbot at Scale AI/ML todo Streaming tokens, session, cost control 68 RAG System over Enterprise Documents AI/ML todo Chunking, embeddings, retrieval, grounding 69 A/B Testing \u0026amp; Experimentation Platform AI/ML todo Assignment, metrics, stat significance 70 Feature Store for ML AI/ML todo Online vs offline, point-in-time correctness 71 ML Model Serving Infrastructure AI/ML todo Shadow mode, canary, latency SLO 72 Vector Database \u0026amp; Semantic Search AI/ML todo HNSW, ANN, embedding freshness Data \u0026amp; Storage Systems # # Topic Category Status Published URL Notes 73 Search Engine Internals (Elasticsearch) Data \u0026amp; Storage todo Inverted index, relevance scoring 74 Time-Series Database Data \u0026amp; Storage todo InfluxDB, downsampling, retention 75 Graph Database \u0026amp; Social Network Queries Data \u0026amp; Storage todo Neo4j, shortest path, friend-of-friend 76 Data Warehouse \u0026amp; Lakehouse Architecture Data \u0026amp; Storage todo Iceberg, Parquet, partitioning 77 Change Data Capture (CDC) Data \u0026amp; Storage todo Debezium, Binlog tailing, event propagation 78 Consistent Hashing Deep Dive Data \u0026amp; Storage todo Virtual nodes, hot spots, rebalancing 79 Bloom Filter \u0026amp; Probabilistic Data Structures Data \u0026amp; Storage todo HyperLogLog, Count-Min Sketch 80 LRU / LFU Cache Implementation Data \u0026amp; Storage todo LinkedHashMap, Caffeine, eviction Reliability \u0026amp; Operations # # Topic Category Status Published URL Notes 81 Distributed Tracing System Reliability todo OpenTelemetry, sampling, tail-based 82 Circuit Breaker \u0026amp; Bulkhead Patterns Reliability todo Resilience4j, half-open, fallback 83 Disaster Recovery — RTO / RPO Planning Reliability todo Backup strategies, failover runbook 84 Chaos Engineering Framework Reliability todo Steady state, blast radius, game days 85 Zero-Downtime Deployments \u0026amp; Schema Migrations Reliability todo Blue-green, expand-contract, canary 86 Distributed Lock Service Reliability todo Redlock, fencing tokens, ZooKeeper 87 Leader Election \u0026amp; Consensus (Raft / Paxos) Reliability todo Split-brain, quorum, term numbers 88 Multi-Region Active-Active Design Reliability todo Conflict resolution, CRDT, global LB Security \u0026amp; Compliance # # Topic Category Status Published URL Notes 89 Identity \u0026amp; Access Management (IAM) Security todo RBAC vs ABAC, policy engine 90 OAuth2 \u0026amp; OpenID Connect Deep Dive Security todo Token lifecycle, PKCE, refresh rotation 91 Zero-Trust Network Architecture Security todo mTLS, BeyondCorp, SPIFFE/SPIRE 92 Audit Logging \u0026amp; Compliance Trail Security todo Immutable log, SOC2, GDPR 93 GDPR Right-to-Erasure Implementation Security todo Crypto-shredding, propagation 94 Data Masking \u0026amp; Tokenisation Service Security todo PCI DSS, PII vault 95 Healthcare — Patient Record System (EHR) Compliance todo HIPAA, consent management Architecture Patterns # # Topic Category Status Published URL Notes 96 Event Sourcing + CQRS Architecture todo Append-only log, projection rebuild 97 Saga Pattern (Distributed Transactions) Architecture todo Choreography vs orchestration 98 Strangler Fig \u0026amp; Anti-Corruption Layer Architecture todo Monolith migration, domain boundary 99 Multi-Tenant SaaS Platform Architecture Architecture todo Isolation models, noisy neighbour 100 Outbox Pattern + Transactional Messaging Architecture todo At-least-once, idempotent consumers Java Deep Dives # # Topic Category Status Published URL Notes 101 Virtual Threads vs Reactive (Loom vs WebFlux) Java Deep Dive todo Java 21, I/O bound, thread-per-request 102 JVM GC Tuning for Production Java Deep Dive todo G1 vs ZGC vs Shenandoah, Generational ZGC 103 Spring Boot 3 + GraalVM Native Image Java Deep Dive todo AOT, reflection hints, startup time 104 Structured Concurrency (Java 21) Java Deep Dive todo StructuredTaskScope, cancellation 105 CompletableFuture Pitfalls in Production Java Deep Dive todo Error propagation, thread pool starvation 106 Domain-Driven Design with Records \u0026amp; Sealed Classes Java Deep Dive todo Value objects, aggregates, exhaustive switch 107 Database Connection Pool Tuning (HikariCP) Java Deep Dive todo Pool sizing formula, leak detection 108 Reactive Streams \u0026amp; Backpressure (Project Reactor) Java Deep Dive todo Flux, Mono, scheduler selection Geospatial \u0026amp; Location # # Topic Category Status Published URL Notes 109 Ride-Hailing Pricing Engine (Surge) Geospatial todo Real-time demand model, elasticity 110 Location Tracking \u0026amp; Geo-Fencing Service Geospatial todo Moving objects, polygon queries, alerts 111 Food Delivery Dispatch System Geospatial todo Assignment optimisation, ETA, batching High Performance # # Topic Category Status Published URL Notes 112 High-Frequency Trading Infrastructure High Performance todo Kernel bypass, co-location, FPGA 113 Video Conferencing (WebRTC Infrastructure) High Performance todo SFU vs MCU, TURN/STUN, jitter buffer 114 IOT Device Management Platform High Performance todo MQTT, device shadow, OTA updates 115 Service Mesh + Observability (Istio / Envoy) High Performance todo mTLS, traffic policy, telemetry Bonus: Platform \u0026amp; FinOps # # Topic Category Status Published URL Notes 116 Internal Developer Platform (IDP) Architecture todo Golden paths, self-service, paved road 117 Cost Optimisation Framework (FinOps) Architecture todo Right-sizing, spot strategy, waste 118 gRPC vs REST vs GraphQL — Protocol Trade-offs Architecture todo When to pick which, streaming, contracts 119 Event-Driven Architecture Deep Dive Architecture todo Domain events, eventual consistency 120 Ad Click Aggregation \u0026amp; Attribution System Scalability todo Lambda arch, exactly-once, privacy Progress # Total topics: 120 Published: 1 (1%) In progress: 0 Todo: 119 Last updated: 2026-04-18\nCategory Breakdown # Category Count Classic \u0026ldquo;Design X\u0026rdquo; 20 Fintech 11 E-Commerce \u0026amp; Marketplace 9 Booking \u0026amp; Reservation 5 Real-Time \u0026amp; Streaming 6 Infra \u0026amp; Developer Tools 8 Content \u0026amp; Media 6 AI / ML Systems 7 Data \u0026amp; Storage 8 Reliability \u0026amp; Operations 8 Security \u0026amp; Compliance 7 Architecture Patterns 5 Java Deep Dives 8 Geospatial \u0026amp; Location 3 High Performance 4 Platform \u0026amp; FinOps 5 Total 120 Suggested Study Sequence # Weeks 1–4 (Foundation): #1–20 — Classic questions. Build the pattern muscle.\nWeeks 5–8 (Fintech focus): #21–31 — Moniepoint-relevant depth.\nWeeks 9–12 (E-Commerce + Booking): #32–45 — Transactional systems, contention.\nWeeks 13–16 (Real-Time + Infra): #46–59 — Operational maturity signals.\nWeeks 17–20 (Content + AI/ML): #60–72 — Modern system design vocabulary.\nWeeks 21–24 (Data + Reliability): #73–88 — Senior/staff-level depth.\nWeeks 25–28 (Security + Arch + Java): #89–120 — EM/architect differentiation.\nAt 1 topic / weekday: ~24 weeks to full coverage. At 1 topic / day: 4 months.\n","date":"18 April 2026","externalUrl":null,"permalink":"/quest-sheet/","section":"","summary":" Classic “Design X” Questions # # Topic Category Status Published URL Notes 1 URL Shortener (bit.ly) Classic published /system-design/classic/url-shortener Base62, hash collisions, redirect latency 2 Twitter / Social Media Feed Classic todo Fan-out on write vs read, timeline 3 Instagram Classic todo Photo storage, explore, hashtag index 4 WhatsApp / Chat Messaging System Classic todo WebSocket, message ordering, E2EE 5 Uber / Ride-Sharing System Classic todo Geohash, supply-demand matching 6 Netflix / Video Streaming Platform Classic todo CDN, adaptive bitrate, personalisation 7 YouTube Classic todo Upload pipeline, transcoding, comments 8 Dropbox / Google Drive (File Sync) Classic todo Block dedup, delta sync, chunking 9 Google Docs (Real-Time Collaborative Editing) Classic todo OT vs CRDT, conflict-free merges 10 Search Engine (Google-scale) Classic todo Crawl → index → rank → serve 11 Google Maps / Routing Engine Classic todo Dijkstra, A*, road graph sharding 12 Web Crawler Classic todo Politeness, dedup, frontier scheduling 13 Recommendation System (end-to-end) Classic todo Collaborative filtering, two-tower, serving 14 Notification Service (email, push, SMS at scale) Classic todo Fanout, deduplication, delivery tracking 15 Rate Limiter Classic todo Token bucket, sliding window, Redis 16 Distributed Cache Classic todo Eviction policies, clustering, consistency 17 Key-Value Store (Redis / DynamoDB internals) Classic todo LSM tree, WAL, consistent hashing 18 Distributed Message Queue (Kafka) Classic todo Partitions, offsets, consumer groups 19 Logging \u0026 Metrics System (Datadog / ELK) Classic todo Structured logs, TSDB, alerting 20 Distributed File System (HDFS / GFS) Classic todo NameNode, replication, rack awareness Fintech # # Topic Category Status Published URL Notes 21 Payment Processing System Fintech todo Idempotency, saga, PCI DSS 22 Digital Wallet Fintech todo Balance model, top-up, withdrawal 23 Money Transfer System (Venmo / Wise / Moniepoint) Fintech todo Cross-border, FX, settlement rails 24 Fraud Detection System Fintech todo Rule engine + ML, real-time scoring 25 Card Authorization System Fintech todo Issuer, network, sub-100ms auth 26 Ledger / Double-Entry Bookkeeping System Fintech todo Immutable entries, balance integrity 27 Reconciliation System (two financial systems) Fintech todo Eventual consistency, diff engine 28 KYC / AML Onboarding Flow Fintech todo Watchlist, PEP screening, risk scoring 29 High-Throughput Transaction Processing System Fintech todo LMAX Disruptor, mechanical sympathy 30 Currency Conversion System Fintech todo FX rate feed, rounding, audit trail 31 Banking Core — Account Balances at Scale Fintech todo ACID at scale, regulatory reporting E-Commerce \u0026 Marketplace # # Topic Category Status Published URL Notes 32 Amazon — Catalog, Search \u0026 Checkout E-Commerce todo Product graph, ranking, checkout saga 33 Flash Sale / High-Contention Inventory E-Commerce todo Thundering herd, queue, fairness 34 Shopping Cart (multi-session, multi-device) E-Commerce todo Merge strategies, guest → auth 35 Inventory Management System (multi-warehouse) E-Commerce todo Oversell prevention, reservation TTL 36 Price Comparison Engine E-Commerce todo Crawl-and-normalize, ranking, freshness 37 Coupon \u0026 Promotion Engine E-Commerce todo Rule DSL, stacking, abuse prevention 38 Airbnb — Search, Booking \u0026 Availability E-Commerce todo Geo search, calendar blocking, pricing 39 Order Management System E-Commerce todo State machine, fulfilment pipeline 40 Loyalty / Points \u0026 Rewards System E-Commerce todo Ledger, expiry, redemption Booking \u0026 Reservation # # Topic Category Status Published URL Notes 41 Movie Ticket Booking System (BookMyShow) Booking todo Seat lock, payment window, concurrency 42 Hotel Booking System Booking todo Availability calendar, overbooking policy 43 Airline Reservation System Booking todo PNR, seat classes, fare rules 44 Restaurant Reservation System (OpenTable) Booking todo Table inventory, waitlist, no-show 45 Calendar \u0026 Scheduling System (Calendly) Booking todo Availability slots, timezone, conflict Real-Time \u0026 Streaming # # Topic Category Status Published URL Notes 46 Live Video Streaming Platform (Twitch) Real-Time todo Ingest, transcode, CDN edge, chat 47 Live Sports Scores System Real-Time todo Push vs poll, SSE, fan-out 48 Online Multiplayer Game Backend Real-Time todo State sync, authoritative server, lag comp 49 Stock Trading Platform Real-Time todo Order matching, LMAX, market data feed 50 Real-Time Analytics Dashboard Real-Time todo Kafka + Flink, OLAP, query latency 51 Collaborative Whiteboard Real-Time todo CRDT, WebSocket, cursor presence Infrastructure \u0026 Developer Tools # # Topic Category Status Published URL Notes 52 CI/CD System Infra \u0026 DevTools todo Pipeline DAG, artifact store, rollback 53 Feature Flag Service (LaunchDarkly) Infra \u0026 DevTools todo Progressive rollout, targeting rules 54 Configuration Management System Infra \u0026 DevTools todo Hot reload, versioning, audit 55 Secrets Management System (Vault) Infra \u0026 DevTools todo Dynamic secrets, lease renewal, KMS 56 API Gateway Infra \u0026 DevTools todo Auth, routing, throttling, observability 57 Service Registry \u0026 Discovery Infra \u0026 DevTools todo Consul, Eureka, health checks 58 Distributed Job Scheduler (cron at scale) Infra \u0026 DevTools todo Exactly-once, leader election, sharding 59 Workflow Engine (Airflow / Temporal) Infra \u0026 DevTools todo DAG execution, retries, durable state Content \u0026 Media # # Topic Category Status Published URL Notes 60 Content Delivery Network (CDN) Content \u0026 Media todo PoP placement, cache hierarchy, purge 61 Image Hosting \u0026 Serving System Content \u0026 Media todo On-the-fly resize, WebP, CDN offload 62 Podcast Hosting Platform Content \u0026 Media todo Audio storage, RSS, analytics 63 News Feed Aggregator Content \u0026 Media todo RSS crawl, dedup, personalisation 64 Content Moderation System Content \u0026 Media todo ML classifier + human review pipeline 65 Comment System at Scale Content \u0026 Media todo Threading, voting, spam, hot content AI / ML Systems # # Topic Category Status Published URL Notes 66 Recommendation System — Full Pipeline AI/ML todo Candidate gen → ranking → serving 67 LLM-Powered Chatbot at Scale AI/ML todo Streaming tokens, session, cost control 68 RAG System over Enterprise Documents AI/ML todo Chunking, embeddings, retrieval, grounding 69 A/B Testing \u0026 Experimentation Platform AI/ML todo Assignment, metrics, stat significance 70 Feature Store for ML AI/ML todo Online vs offline, point-in-time correctness 71 ML Model Serving Infrastructure AI/ML todo Shadow mode, canary, latency SLO 72 Vector Database \u0026 Semantic Search AI/ML todo HNSW, ANN, embedding freshness Data \u0026 Storage Systems # # Topic Category Status Published URL Notes 73 Search Engine Internals (Elasticsearch) Data \u0026 Storage todo Inverted index, relevance scoring 74 Time-Series Database Data \u0026 Storage todo InfluxDB, downsampling, retention 75 Graph Database \u0026 Social Network Queries Data \u0026 Storage todo Neo4j, shortest path, friend-of-friend 76 Data Warehouse \u0026 Lakehouse Architecture Data \u0026 Storage todo Iceberg, Parquet, partitioning 77 Change Data Capture (CDC) Data \u0026 Storage todo Debezium, Binlog tailing, event propagation 78 Consistent Hashing Deep Dive Data \u0026 Storage todo Virtual nodes, hot spots, rebalancing 79 Bloom Filter \u0026 Probabilistic Data Structures Data \u0026 Storage todo HyperLogLog, Count-Min Sketch 80 LRU / LFU Cache Implementation Data \u0026 Storage todo LinkedHashMap, Caffeine, eviction Reliability \u0026 Operations # # Topic Category Status Published URL Notes 81 Distributed Tracing System Reliability todo OpenTelemetry, sampling, tail-based 82 Circuit Breaker \u0026 Bulkhead Patterns Reliability todo Resilience4j, half-open, fallback 83 Disaster Recovery — RTO / RPO Planning Reliability todo Backup strategies, failover runbook 84 Chaos Engineering Framework Reliability todo Steady state, blast radius, game days 85 Zero-Downtime Deployments \u0026 Schema Migrations Reliability todo Blue-green, expand-contract, canary 86 Distributed Lock Service Reliability todo Redlock, fencing tokens, ZooKeeper 87 Leader Election \u0026 Consensus (Raft / Paxos) Reliability todo Split-brain, quorum, term numbers 88 Multi-Region Active-Active Design Reliability todo Conflict resolution, CRDT, global LB Security \u0026 Compliance # # Topic Category Status Published URL Notes 89 Identity \u0026 Access Management (IAM) Security todo RBAC vs ABAC, policy engine 90 OAuth2 \u0026 OpenID Connect Deep Dive Security todo Token lifecycle, PKCE, refresh rotation 91 Zero-Trust Network Architecture Security todo mTLS, BeyondCorp, SPIFFE/SPIRE 92 Audit Logging \u0026 Compliance Trail Security todo Immutable log, SOC2, GDPR 93 GDPR Right-to-Erasure Implementation Security todo Crypto-shredding, propagation 94 Data Masking \u0026 Tokenisation Service Security todo PCI DSS, PII vault 95 Healthcare — Patient Record System (EHR) Compliance todo HIPAA, consent management Architecture Patterns # # Topic Category Status Published URL Notes 96 Event Sourcing + CQRS Architecture todo Append-only log, projection rebuild 97 Saga Pattern (Distributed Transactions) Architecture todo Choreography vs orchestration 98 Strangler Fig \u0026 Anti-Corruption Layer Architecture todo Monolith migration, domain boundary 99 Multi-Tenant SaaS Platform Architecture Architecture todo Isolation models, noisy neighbour 100 Outbox Pattern + Transactional Messaging Architecture todo At-least-once, idempotent consumers Java Deep Dives # # Topic Category Status Published URL Notes 101 Virtual Threads vs Reactive (Loom vs WebFlux) Java Deep Dive todo Java 21, I/O bound, thread-per-request 102 JVM GC Tuning for Production Java Deep Dive todo G1 vs ZGC vs Shenandoah, Generational ZGC 103 Spring Boot 3 + GraalVM Native Image Java Deep Dive todo AOT, reflection hints, startup time 104 Structured Concurrency (Java 21) Java Deep Dive todo StructuredTaskScope, cancellation 105 CompletableFuture Pitfalls in Production Java Deep Dive todo Error propagation, thread pool starvation 106 Domain-Driven Design with Records \u0026 Sealed Classes Java Deep Dive todo Value objects, aggregates, exhaustive switch 107 Database Connection Pool Tuning (HikariCP) Java Deep Dive todo Pool sizing formula, leak detection 108 Reactive Streams \u0026 Backpressure (Project Reactor) Java Deep Dive todo Flux, Mono, scheduler selection Geospatial \u0026 Location # # Topic Category Status Published URL Notes 109 Ride-Hailing Pricing Engine (Surge) Geospatial todo Real-time demand model, elasticity 110 Location Tracking \u0026 Geo-Fencing Service Geospatial todo Moving objects, polygon queries, alerts 111 Food Delivery Dispatch System Geospatial todo Assignment optimisation, ETA, batching High Performance # # Topic Category Status Published URL Notes 112 High-Frequency Trading Infrastructure High Performance todo Kernel bypass, co-location, FPGA 113 Video Conferencing (WebRTC Infrastructure) High Performance todo SFU vs MCU, TURN/STUN, jitter buffer 114 IOT Device Management Platform High Performance todo MQTT, device shadow, OTA updates 115 Service Mesh + Observability (Istio / Envoy) High Performance todo mTLS, traffic policy, telemetry Bonus: Platform \u0026 FinOps # # Topic Category Status Published URL Notes 116 Internal Developer Platform (IDP) Architecture todo Golden paths, self-service, paved road 117 Cost Optimisation Framework (FinOps) Architecture todo Right-sizing, spot strategy, waste 118 gRPC vs REST vs GraphQL — Protocol Trade-offs Architecture todo When to pick which, streaming, contracts 119 Event-Driven Architecture Deep Dive Architecture todo Domain events, eventual consistency 120 Ad Click Aggregation \u0026 Attribution System Scalability todo Lambda arch, exactly-once, privacy Progress # Total topics: 120 Published: 1 (1%) In progress: 0 Todo: 119 Last updated: 2026-04-18\n","title":"System Design Quest Sheet","type":"page"},{"content":"","date":"18 April 2026","externalUrl":null,"permalink":"/tags/","section":"Tags","summary":"","title":"Tags","type":"tags"},{"content":" 1. Hook # Every time you click a bit.ly or t.co link, a distributed system silently resolves a 7-character code to a full URL and redirects you — in under 10 milliseconds — before your browser even renders the loading spinner. Behind that invisible handshake sits a deceptively rich design problem: how do you build a service that creates billions of short codes, never loses a mapping, and serves hundreds of thousands of reads per second with single-digit millisecond latency, all while preventing abuse, surviving data-centre failures, and staying profitable?\nThe URL Shortener is a canonical warm-up question in system design interviews precisely because it spans the full stack — hashing, storage, caching, CDN, security, and analytics — without overwhelming complexity. Master it and you have a reusable vocabulary for every \u0026ldquo;design at scale\u0026rdquo; discussion that follows.\n2. Problem Statement # Functional Requirements # Shorten: Given a long URL, return a unique short code (e.g., https://sho.rt/aB3xYz). Redirect: GET /\u0026lt;code\u0026gt; responds with HTTP 301/302 to the original URL. Custom aliases: Users may optionally specify a desired short code (subject to availability). Expiry: URLs may have an optional TTL after which the short link is invalidated. Analytics: Track click count, referrer, and geo per short code (async, non-blocking on redirect). Non-Functional Requirements # Property Target Redirect latency (p99) \u0026lt; 10 ms Write latency (shorten) \u0026lt; 200 ms Availability 99.99% (\u0026lt; 53 min downtime/year) Durability Zero mapping loss Read:Write ratio ~200:1 Short code length 7 characters (Base62) Out of Scope # Rich link preview / Open Graph metadata generation A/B split redirects QR code generation Browser extensions or mobile SDKs 3. Scale Estimation # Assumptions\n100 M new URLs shortened per day (write-heavy by internet standards, but still dwarfed by reads) 20 B redirects per day (200:1 read:write) Average long URL: 200 bytes; short URL record: ~500 bytes total (with metadata) Retention: 5 years Metric Calculation Result Write QPS 100 M / 86 400 s ~1 160 writes/s Read QPS (avg) 20 B / 86 400 s ~231 000 reads/s Read QPS (peak, 10×) 231 K × 10 ~2.3 M reads/s Storage/day 100 M × 500 B ~50 GB/day Storage/5 years 50 GB × 365 × 5 ~91 TB Redirect bandwidth 231 K × 500 B ~115 MB/s avg Cache size (20% hot) 20 B × 20% × 500 B ~2 TB working set Key insight: The system is overwhelmingly read-dominated. The primary design challenge is serving 2.3 M reads/second at sub-10 ms latency — not the write path.\n4. High-Level Design # graph TD Client[\"Browser / Mobile App\"] DNS[\"DNS / Anycast\"] CDN[\"CDN Edge PoP\\n(Cloudflare / Fastly)\"] LB[\"L7 Load Balancer\\n(NGINX / Envoy)\"] WriteAPI[\"Write API Cluster\\n(Shorten Service)\"] ReadAPI[\"Redirect Service Cluster\\n(Read-heavy)\"] Cache[\"Redis Cluster\\n(code → long_url)\"] DB[\"Primary DB\\n(PostgreSQL / Cassandra)\"] DBReplica[\"Read Replicas × N\"] Analytics[\"Analytics Kafka Topic\"] AnalyticsConsumer[\"Flink / Spark\\nStreaming Consumer\"] AnalyticsStore[\"ClickHouse\\n(Analytics OLAP)\"] Client --\u003e|\"GET /aB3xYz\"| DNS DNS --\u003e CDN CDN --\u003e|\"Cache miss\"| LB LB --\u003e ReadAPI ReadAPI --\u003e|\"L1 miss\"| Cache Cache --\u003e|\"L2 miss\"| DBReplica ReadAPI --\u003e|\"fire-and-forget\"| Analytics Client --\u003e|\"POST /api/shorten\"| LB LB --\u003e WriteAPI WriteAPI --\u003e DB DB --\u003e|\"replication\"| DBReplica WriteAPI --\u003e|\"prime cache\"| Cache Analytics --\u003e AnalyticsConsumer --\u003e AnalyticsStore Read path: Browser → CDN (HTTP cache, TTL ~60 s for 302) → Redirect Service → Redis L1 (hit rate ~95%) → DB read replica (cache miss). Response is a single 302 HTTP redirect.\nWrite path: Client → Write API → generate code → persist to primary DB → prime Redis → return short URL. Entirely off the critical redirect path.\n5. Deep Dive # 5.1 Short Code Generation — Base62 + Counter vs. Hashing # This is the crux of the design. There are three viable strategies:\nStrategy A: MD5/SHA-256 hash of the long URL, take first 7 chars\nHash the URL, encode as Base62, truncate to 7 characters. Simple, but collision probability is non-trivial: with 7 Base62 characters you have 62⁷ ≈ 3.5 trillion slots. For 100 M URLs/day over 5 years that is ~182 B entries — about 5% of the keyspace. The birthday paradox means you will start seeing collisions well before saturation; you need a retry loop with an incremented salt.\nWorse, two users shortening the same URL get the same code — which is a feature for deduplication but a bug if user A\u0026rsquo;s URL expires and user B\u0026rsquo;s doesn\u0026rsquo;t.\nStrategy B: Auto-increment counter + Base62 encode (chosen)\nMaintain a globally unique, monotonically increasing counter. Encode it in Base62 ([0-9A-Za-z]). A 7-character Base62 number gives ~3.5 T unique codes — enough for 96 years at 100 M/day.\nThe counter can live in a dedicated Counter Service backed by Redis INCR (atomic, single-threaded in Redis). To avoid a hot single Redis node and the SPOF it creates, pre-allocate ranges to each Write API node: node 1 owns [1..1000], node 2 owns [1001..2000], and so on. Each node burns through its range in memory before requesting a new batch — similar to Flickr\u0026rsquo;s ticket servers or Twitter Snowflake.\n// Java 17 record for a pre-allocated counter range public record CounterRange(long start, long end, AtomicLong current) { public static CounterRange of(long start, long end) { return new CounterRange(start, end, new AtomicLong(start)); } public OptionalLong next() { long val = current.getAndIncrement(); return val \u0026lt;= end ? OptionalLong.of(val) : OptionalLong.empty(); } } public final class Base62Encoder { private static final String ALPHABET = \u0026#34;0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\u0026#34;; public static String encode(long n) { if (n == 0) return \u0026#34;0\u0026#34;; var sb = new StringBuilder(); while (n \u0026gt; 0) { sb.append(ALPHABET.charAt((int)(n % 62))); n /= 62; } return sb.reverse().toString(); } } Strategy C: UUID / random\n128-bit random, truncated. No coordination needed, but high collision risk and no natural ordering for range scans.\nVerdict: Counter + Base62 wins. It\u0026rsquo;s collision-free by construction, produces compact codes, and the batch-range trick eliminates coordination on the hot path.\n5.2 Redirect Service # The Redirect Service is a thin, stateless HTTP layer. Its sole job is:\nParse /{code} from the path. Look up the long URL in the local L1 cache (an in-process Caffeine cache, 10 K entries, 5 s TTL). On miss, look up Redis (sub-millisecond over a private network). On Redis miss, query a DB read replica and backfill Redis (TTL: 24 h). If the code is expired or unknown, return 404. Emit a click event to Kafka (fire-and-forget, async, non-blocking). Return HTTP 302 to the long URL. 301 vs 302: A 301 (permanent) is cached by the browser indefinitely — great for bandwidth, terrible for analytics since subsequent clicks never reach your servers. Bit.ly uses 301 for bandwidth savings but loses analytics fidelity on repeat visitors. Most enterprise shorteners use 302 (temporary) so every click is trackable. Use 302 unless bandwidth is the dominant cost.\n5.3 CDN Layer # For popular short codes (viral links, marketing campaigns), push the 302 redirect to CDN edge nodes. CDN Cache-Control: max-age=60 means the edge serves the redirect without touching origin for 60 seconds. At 2.3 M peak RPS, even a 70% CDN hit rate offloads 1.6 M RPS from the origin fleet.\nCustom aliases and codes with imminent expiry should be tagged Cache-Control: no-store to avoid serving stale 404s from CDN.\n6. Data Model # Primary URL Table (PostgreSQL or Cassandra) # Column Type Notes code VARCHAR(10) Primary key, Base62-encoded counter long_url TEXT Up to 8 KB user_id BIGINT FK to users; nullable for anonymous created_at TIMESTAMPTZ Creation time expires_at TIMESTAMPTZ Nullable; NULL = never expires is_custom BOOLEAN True if user-specified alias click_count BIGINT Approximate; updated async Indexes:\nPrimary key on code — covers all redirect lookups. (user_id, created_at DESC) — covers \u0026ldquo;show my links\u0026rdquo; dashboard queries. Partial index on expires_at WHERE expires_at IS NOT NULL — efficient TTL sweep job. Partitioning: At 91 TB over 5 years, partition by created_at month in PostgreSQL. Old partitions (\u0026gt; 5 years) are detached and archived to object storage (S3 Glacier).\nWhy not Cassandra? For pure key-value redirect lookups, Cassandra\u0026rsquo;s wide-column store is a natural fit and scales writes horizontally without a leader. However, Cassandra sacrifices ad-hoc querying and strong consistency. If analytics and user dashboards are important (they are), PostgreSQL with read replicas and a Redis cache layer is simpler to operate. At truly massive scale (\u0026gt;10 B codes), migrate the hot redirect table to Cassandra while keeping the analytics in PostgreSQL.\nRedis Cache Schema # SET url:{code} \u0026#34;{long_url}\u0026#34; EX 86400 A single string key per code. At 500 bytes per entry and 95% hit rate, a 3-node Redis cluster (128 GB each) comfortably holds the working set.\n7. Trade-offs # Counter Service: Centralised vs. Distributed Range Allocation # Option Pros Cons When to Use Single Redis INCR Simple, no coordination SPOF; Redis goes down = no writes Prototype, \u0026lt; 1 K writes/s Batch range allocation (chosen) No coordination on hot path; each node is autonomous per range Small gap in counter sequence if a node crashes mid-range (harmless) Production; \u0026gt;1 K writes/s Snowflake-style (timestamp + worker ID + sequence) Fully decentralised; no shared state Clock skew risk; requires worker ID assignment Ultra-high scale; multi-region writes Conclusion: Batch range allocation balances simplicity and scalability. Gaps of up to 1000 codes on a node crash are invisible to users and don\u0026rsquo;t affect correctness.\n301 vs. 302 Redirect # Option Pros Cons When to Use 301 Permanent Browser caches; zero repeat traffic to origin Analytics blind on repeat visits; cannot revoke Static content links where analytics don\u0026rsquo;t matter 302 Temporary (chosen) Every click tracked; supports expiry and revocation Slightly higher origin traffic Any use-case needing analytics or TTL SQL vs. NoSQL for URL Store # Option Pros Cons PostgreSQL ACID, rich queries, familiar ops tooling Vertical scaling limit; write-heavy workloads need sharding Cassandra Horizontal write scale; tunable consistency No ad-hoc queries; eventual consistency by default Conclusion: Start with PostgreSQL + read replicas + Redis cache. Migrate redirect lookups to Cassandra only when writes exceed 50 K/s sustained.\nCAP Trade-off # The system leans AP on the redirect path (availability + partition tolerance). A Redis replica can serve slightly stale data — an expired URL might redirect for a few seconds after expiry. This is acceptable. The write path is CP: counter allocation and URL persistence are strongly consistent so no duplicate codes are ever issued.\n8. Failure Modes # Component Failure Impact Mitigation Redis cache Node crash Cache miss spike; DB read replicas overwhelmed Redis Cluster (3 primaries, 3 replicas); circuit breaker on DB fan-out Counter Service Redis INCR unavailable Write API cannot generate new codes Fallback to UUID-based random code; alert on-call DB Primary Crash Writes fail; reads from replicas only Automated failover via Patroni (PostgreSQL HA); RPO \u0026lt; 1 s with synchronous replica Redirect Service pod OOM / crash Subset of requests 502 k8s liveness probe + readiness probe; HPA scales out on latency Thundering herd on viral URL Cache stampede after TTL expiry Thousands of requests hit DB simultaneously Probabilistic early expiration (PER); Redis SET NX mutex per code during refresh Analytics Kafka Broker failure Click events lost min.insync.replicas=2; acks=all on producer; DLQ for failed events CDN misconfiguration Stale 302 cached past TTL Users redirected to wrong/expired URL Short max-age (60 s); purge API on URL update/expiry 9. Security \u0026amp; Compliance # Authentication \u0026amp; Authorisation: Anonymous shortening is permitted (rate-limited). Authenticated users (OAuth2 / JWT) can manage their own links. Admins can take down any link. RBAC: anonymous, user, admin.\nInput Validation: Long URLs are validated against RFC 3986 before storage. Block known malicious domains via a real-time threat-intelligence feed (Google Safe Browsing API). Reject URLs with non-HTTP/HTTPS schemes to prevent javascript:, file:, and data: injection.\nRate Limiting: Anonymous shortening is rate-limited to 10 requests/hour per IP (token bucket in Redis). Authenticated users get 1000/hour. Prevents bulk abuse and link-spam campaigns.\nEncryption: All data in transit via TLS 1.3. Long URLs at rest are stored in plaintext (they\u0026rsquo;re already public) but the database volume is encrypted (AES-256). User PII (email, IP) is hashed or pseudonymised per GDPR.\nAudit Log: Every create, update, and delete of a short code is written to an immutable append-only audit log (write to Kafka, consume into ClickHouse with no delete capability). Supports GDPR Right-to-Erasure: mark code as deleted and null out the long URL; the audit event retains the pseudonymised user ID.\nPII / GDPR: Click events store hashed IP (SHA-256 + rotating salt per 24 h) rather than raw IP. Referrer headers are stripped to the domain only. Geo is inferred from IP at collection time and the raw IP is discarded.\n10. Observability # RED Metrics (per service) # Metric Alert Threshold Redirect request rate (RPS) Baseline ± 30% — sudden drop = traffic black-hole Redirect error rate (4xx/5xx) \u0026gt; 0.1% sustained over 1 min Redirect p99 latency \u0026gt; 10 ms for \u0026gt; 2 min Cache hit rate (Redis) \u0026lt; 90% — signals cache eviction or miss storm Write error rate \u0026gt; 0.5% Saturation Metrics # Redis memory utilisation: alert at 75% — time to add a shard. DB replica replication lag: alert at \u0026gt; 5 s — reads may become stale. Counter range exhaustion rate: alert when a node requests a new range more than once per minute (means range size is too small). Business Metrics (ClickHouse dashboard) # Clicks per short code per hour (viral detection) Geographic distribution of clicks Top referrer domains DAU/MAU of shortening feature Tracing # Distributed traces via OpenTelemetry (OTLP → Jaeger / Tempo). Every redirect request carries a trace-id header. Sampling strategy: 1% baseline + 100% on error. Tail-based sampling in the collector keeps storage costs manageable.\n11. Scaling Path # Phase 1 — MVP (\u0026lt; 100 RPS) # Single PostgreSQL instance, no Redis, single Redirect Service pod. Deploy on a managed PaaS (Railway, Render, or a single EC2 instance). Total infrastructure: $50/month. Focus: correctness, not scale.\nPhase 2 — Growth (100 RPS → 10 K RPS) # Add Redis (ElastiCache, 1 primary + 1 replica). Add 3 read replicas to PostgreSQL (RDS Multi-AZ). Redirect Service scales horizontally behind an ALB. CDN in front (Cloudflare free tier). What breaks first: PostgreSQL primary on write storms — add connection pooling (PgBouncer).\nPhase 3 — Scale (10 K → 100 K RPS) # Redis Cluster (6 nodes: 3 primary + 3 replica). Write API uses batch counter ranges. Separate read and write DB roles. Add a CDN Purge API workflow for expiring URLs. Kafka for analytics decoupling. Add a geo-distributed cache (Redis at edge via Cloudflare Workers KV). What breaks first: Redis cluster hot-slot on viral codes — enable read from replicas (READONLY on replica nodes).\nPhase 4 — Hyper-scale (100 K → 1 M+ RPS) # Multi-region active-active. Cassandra replaces PostgreSQL for the redirect table (partition key: code). Counter generation moves to Snowflake-style local generation per region. CDN handles 80%+ of traffic. Redirect Service deployed in 20+ PoPs globally. Analytics becomes a separate service owned by a separate team. What breaks first: cross-region replication lag for newly created codes — accept eventual consistency with a 1–2 s replication window (most new codes are not shared immediately).\n12. Enterprise Considerations # Brownfield Integration: Enterprises often need to integrate a URL shortener into an existing marketing platform or CMS. The Write API should expose a REST and gRPC interface. The redirect domain should be white-label (custom domains like go.acme.com), requiring a wildcard TLS certificate and CNAME delegation — solved with Cloudflare\u0026rsquo;s SSL for SaaS product or cert-manager in k8s.\nBuild vs. Buy: Managed options (Bitly Enterprise, Rebrandly, short.io) cost $300–$2000/month for high-volume plans but remove operational burden. Build when: custom analytics integration, data sovereignty requirements, or \u0026gt; 1 B redirects/month (where managed pricing becomes punitive). Typical TCO for a self-hosted solution at 100 K RPS: ~$8 K/month cloud spend + 1 SRE FTE.\nMulti-Tenancy: SaaS teams need namespace isolation — each tenant gets a subdomain (tenant.sho.rt) and their codes are namespaced ({tenant_id}:{code}). The Redis key becomes url:{tenant_id}:{code}. DB partitioning by tenant_id prevents noisy-neighbour query storms.\nVendor Lock-In: Redis is the highest lock-in risk. Design the cache layer behind an interface (UrlCache) so you can swap Redis for Memcached, DynamoDB DAX, or an in-process Caffeine cache without changing the Redirect Service.\nConway\u0026rsquo;s Law: The system naturally splits into three teams: Platform (counter service, storage, DB), Product (shorten API, custom domains, expiry), and Data (analytics, ClickHouse, dashboards). Microservice boundaries should mirror these team boundaries to avoid cross-team coupling on deployments.\n13. Interview Tips # Start with clarifying questions: \u0026ldquo;Do we need analytics?\u0026rdquo; and \u0026ldquo;Is 7 characters fixed?\u0026rdquo; change the design significantly. Anchoring to requirements before drawing boxes shows seniority.\nLead with the read path: Interviewers expect you to notice the 200:1 read:write skew immediately. Open with \u0026ldquo;this is a read-heavy system — my primary concern is redirect latency, not write throughput\u0026rdquo; and you signal the right mental model.\nCommon mistake — hashing without collision handling: Candidates propose MD5 truncation and stop there. Always acknowledge the birthday problem and describe your retry or deduplication strategy.\nDeep-dive bait: The counter service is a rich rabbit hole. Know Snowflake IDs, Flickr-style ticket servers, and the batch-range pattern. Expect the interviewer to ask \u0026ldquo;what happens if the counter service node crashes mid-range?\u0026rdquo;\nVocabulary that signals fluency: \u0026ldquo;probabilistic early expiration\u0026rdquo;, \u0026ldquo;cache stampede\u0026rdquo;, \u0026ldquo;fan-out on write\u0026rdquo;, \u0026ldquo;Base62 keyspace\u0026rdquo;, \u0026ldquo;HTTP 301 vs 302 analytics trade-off\u0026rdquo;, \u0026ldquo;Anycast DNS for geo-routing\u0026rdquo;. Drop two or three naturally and don\u0026rsquo;t over-explain them.\n14. Further Reading # Designing Data-Intensive Applications — Martin Kleppmann, Chapters 5–6 (Replication \u0026amp; Partitioning) — the canonical primer on the distributed storage concepts underlying this system. Bitly Engineering Blog — \u0026ldquo;Building a reliable URL shortener\u0026rdquo; — real-world lessons on Redis cluster sharding and CDN cache invalidation at scale. RFC 3986 — Uniform Resource Identifier (URI): Generic Syntax — defines what a valid URL is; essential for input validation logic. Google Safe Browsing API documentation — for integrating real-time malicious URL detection into the write path. ","date":"18 April 2026","externalUrl":null,"permalink":"/system-design/classic/url-shortener/","section":"System designs - 100+","summary":"1. Hook # Every time you click a bit.ly or t.co link, a distributed system silently resolves a 7-character code to a full URL and redirects you — in under 10 milliseconds — before your browser even renders the loading spinner. Behind that invisible handshake sits a deceptively rich design problem: how do you build a service that creates billions of short codes, never loses a mapping, and serves hundreds of thousands of reads per second with single-digit millisecond latency, all while preventing abuse, surviving data-centre failures, and staying profitable?\n","title":"URL Shortener (bit.ly)","type":"system-design"},{"content":"","date":"18 April 2026","externalUrl":null,"permalink":"/tags/url-shortener/","section":"Tags","summary":"","title":"Url-Shortener","type":"tags"},{"content":"Microservices patterns are the vocabulary of distributed systems design. Knowing when to apply each one — and when not to — separates an architect who reads pattern books from one who\u0026rsquo;s shipped production systems.\nSaga Pattern # Problem: A business transaction spans multiple services, each with its own database. You can\u0026rsquo;t use a distributed ACID transaction.\nSolution: A saga is a sequence of local transactions. Each step publishes an event or triggers the next step. If a step fails, compensating transactions undo previous steps.\nChoreography-based saga: Services react to events — no central coordinator.\n1. OrderService: creates order → publishes OrderCreated 2. InventoryService: listens → reserves stock → publishes StockReserved 3. PaymentService: listens → charges card → publishes PaymentCompleted 4. OrderService: listens → confirms order Failure at step 3: 3. PaymentService: charge fails → publishes PaymentFailed 2. InventoryService: listens → releases reservation → publishes StockReleased 1. OrderService: listens → cancels order Orchestration-based saga: A saga orchestrator (a service or workflow engine) explicitly coordinates each step.\nSagaOrchestrator: step 1: call InventoryService.reserve() → success step 2: call PaymentService.charge() → fails step 3: call InventoryService.release() (compensate) → return failure When to use which:\nChoreography: fewer services, loose coupling desired, simple failure paths Orchestration: many services, complex failure compensation, need visibility into saga state Real pitfalls:\nCompensating transactions must be idempotent. The network might redeliver a compensation event. Partial failures are hard to reason about. What if the compensation itself fails? Visibility: Where is the saga in its lifecycle? Orchestration is much easier to observe. Saga state must be persisted — if the orchestrator crashes mid-saga, it must be resumable. Tooling: Temporal.io, AWS Step Functions, Axon Framework (Java), Saga state machines in your DB.\nOutbox Pattern # Problem: Service A writes to its database AND publishes an event to Kafka. If the DB write succeeds but Kafka publish fails (or vice versa), you have inconsistency.\nSolution: Write the event to an outbox table in the same database transaction as the business data. A separate relay process reads unprocessed outbox rows and publishes them.\nBEGIN; INSERT INTO orders (id, status) VALUES (123, \u0026#39;PLACED\u0026#39;); INSERT INTO outbox (event_type, payload, processed) VALUES (\u0026#39;ORDER_CREATED\u0026#39;, \u0026#39;{\u0026#34;id\u0026#34;: 123}\u0026#39;, false); COMMIT; -- Both committed atomically, or neither committed -- Separate process (or Debezium via CDC): SELECT * FROM outbox WHERE processed = false ORDER BY created_at; -- For each row: publish to Kafka, then mark processed = true Key properties:\nThe business write and event publication are atomic At-least-once delivery — if the relay crashes after publishing but before marking processed, it publishes again. Consumers must be idempotent. CDC (Debezium) reading the outbox table eliminates the polling relay process — Debezium reacts to the DB change immediately When to use: Any time you need to reliably publish events that correspond to database changes. Critical for event sourcing, notification systems, and service integration.\nCQRS (Command Query Responsibility Segregation) # Problem: The data model optimized for writes (normalized, transactional) is not optimal for reads (denormalized, pre-aggregated). Complex reporting queries are slow on the write model.\nSolution: Separate the write model (command side) from the read model (query side). They can use different data stores, different schemas, even different technologies.\nWrite side: Read side: Commands → Events from write side → OrderService → OrderReadModel (projected view) (Postgres) (Elasticsearch or separate Postgres table) Query: \u0026#34;All orders for user X with product details\u0026#34; → hits denormalized read model → fast, no joins CQRS doesn\u0026rsquo;t require event sourcing, though they\u0026rsquo;re often used together. CQRS just means: the model you write to is different from the model you read from.\nWhen to use:\nComplex domain with significantly different read and write patterns Read performance requirements can\u0026rsquo;t be met with the write model Multiple read representations needed (same data, different views for different consumers) Audit/history requirements (pair with event sourcing) The cost: Eventual consistency between write and read models. When you write, the read model is updated asynchronously — reads may see slightly stale data. Also: two models to maintain, synchronization logic to build and monitor.\nCQRS is not the default. Most CRUD applications don\u0026rsquo;t need it. Introduce it when the read/write impedance mismatch is causing real problems.\nEvent Sourcing # Problem: Traditional systems store current state. You lose history — \u0026ldquo;how did we get here?\u0026rdquo; can\u0026rsquo;t be answered.\nSolution: Store the sequence of events that led to the current state. Current state is derived by replaying events.\nEvents (the source of truth): 1. OrderCreated { id: 1, items: [...] } 2. ItemAdded { item: \u0026#34;SKU-999\u0026#34; } 3. Coupon Applied { code: \u0026#34;SAVE20\u0026#34; } 4. OrderPlaced { total: 80.00 } Current state (derived by replaying events 1–4): Order { id: 1, status: PLACED, total: 80.00, coupon: \u0026#34;SAVE20\u0026#34;, ... } What event sourcing gives you:\nComplete audit trail — not just current state, but every change and why Time travel — replay to any point in time Event replay for new consumers — add a new read model (analytics, cache) by replaying history Debugging — reproduce any production issue by replaying events Decoupling — consumers subscribe to events, not state changes The costs:\nComplexity. Querying current state requires event replay or maintaining snapshots. Simple \u0026ldquo;SELECT * FROM orders\u0026rdquo; doesn\u0026rsquo;t work. Snapshots needed for large event histories — replaying 100,000 events to get current state is slow. Snapshots checkpoint state at intervals. Schema evolution is hard. An event in the log from 3 years ago must still be interpretable today. Event upcasting required. Not for everything. Most services don\u0026rsquo;t need this. Use it for domains where history, auditability, and replayability are first-class requirements (financial ledgers, order management, healthcare records). API Gateway Pattern # Problem: Clients need to call multiple backend services. Logic for auth, rate limiting, routing, and request aggregation is duplicated across services.\nSolution: A single entry point that handles cross-cutting concerns and routes to backend services.\nResponsibilities:\nAuthentication and authorization (validate JWT, check scopes) Rate limiting per client/API key SSL termination Request routing and load balancing Response caching for GET requests Protocol translation (REST to gRPC) Request/response transformation Observability (access logs, metrics per endpoint) Tools: AWS API Gateway, Kong, Nginx, Envoy, Spring Cloud Gateway, Traefik.\nGotcha: Don\u0026rsquo;t put business logic in the API Gateway. It should be routing + cross-cutting concerns. If you\u0026rsquo;re writing conditional logic based on request body content in the gateway, that logic belongs in a service.\nBFF (Backend for Frontend) # Problem: A mobile app and a web app have different data needs. The web app needs rich data; the mobile app needs lightweight responses. Building one API that serves both leads to over-fetching on mobile or under-fetching on web.\nSolution: A dedicated backend service per frontend type — a BFF. Each BFF aggregates and shapes data from downstream services specifically for its frontend.\nMobile App → Mobile BFF → UserService, OrderService (aggregated, optimized for mobile) Web App → Web BFF → UserService, OrderService, RecommendationService (rich, desktop-optimized) The BFF is owned by the frontend team. They understand their data needs and can evolve their BFF independently. The backend services remain stable.\nWhen BFF makes sense:\nMeaningfully different data requirements across client types Mobile performance is critical (minimize payload, reduce round trips) Frontend team velocity is blocked by backend team changes When it\u0026rsquo;s overkill:\nThe clients have nearly identical data needs You have the team budget to own N BFF services (each BFF is an additional service to maintain) Strangler Fig Pattern # Problem: You need to replace a legacy system (the \u0026ldquo;monolith\u0026rdquo;) but can\u0026rsquo;t do a big-bang rewrite.\nSolution: Progressively route traffic for specific features from the old system to the new one. The old system \u0026ldquo;strangled\u0026rdquo; as more functionality moves out.\nPhase 1: All traffic → Monolith Phase 2: User auth traffic → New Auth Service; rest → Monolith Phase 3: Order creation → New Order Service; rest → Monolith ... Phase N: Monolith retired Implementation: A facade layer (proxy, API gateway, or feature flag router) sits in front of both systems and routes based on the path, header, or user cohort.\nWhy it works: Each piece is a small, bounded migration. Each piece can be tested and validated independently. Rollback is flip the router back. No big bang cutover risk.\nSidecar / Service Mesh # Problem: Cross-cutting concerns (service discovery, mTLS, retries, metrics) are implemented in every service, in every language. Changing the retry policy requires updating 50 services.\nSolution: A sidecar proxy runs alongside each service container. The proxy intercepts all network traffic and handles cross-cutting concerns transparently.\n[Service Pod] ├── App container (your code) └── Envoy sidecar (handles mTLS, retries, circuit breaking, telemetry) Service mesh (Istio, Linkerd): Orchestrates all sidecars with a control plane. Policy changes propagate to all sidecars without application deployments.\nWhat services gain: mTLS, distributed tracing, circuit breaking, load balancing — all without a single line of application code.\nThe cost: Sidecar adds latency (~5ms per hop), memory (~50MB per pod), and operational complexity. Worth it at scale; may not be worth it for 3 services.\nBulkhead Pattern # Problem: A slow downstream dependency consumes all your threads or connections, starving other downstream calls.\nSolution: Isolate each dependency into its own resource pool (thread pool or connection pool). A slow dependency only affects its own pool.\nWithout bulkhead: All 200 threads shared → SlowService consumes all 200 → FastService gets none → everything fails With bulkhead: 50 threads for SlowService → 150 threads for FastService SlowService degrades → FastService unaffected In Java/Spring: Resilience4j @Bulkhead — configure semaphore or thread pool bulkhead per downstream service. Hystrix (deprecated) called these \u0026ldquo;thread pools.\u0026rdquo;\nCombined with circuit breaker: Bulkhead limits concurrent calls; circuit breaker stops calls when failure rate is high. Used together, they prevent a failing dependency from cascading.\n","date":"7 April 2026","externalUrl":null,"permalink":"/posts/system-design-basics/microservices-patterns/","section":"Posts","summary":"Microservices patterns are the vocabulary of distributed systems design. Knowing when to apply each one — and when not to — separates an architect who reads pattern books from one who’s shipped production systems.\n","title":"Microservices Patterns: Saga, CQRS, Event Sourcing, BFF, and More","type":"posts"},{"content":"EM interviews often end with \u0026ldquo;the harder framing\u0026rdquo; — questions about judgment, decision-making under pressure, and how you navigate disagreement. These don\u0026rsquo;t have right answers; they have reasoned answers that demonstrate how you think. Here\u0026rsquo;s a framework for the most common ones.\nBuild vs Buy # The question sounds simple; the answer has layers.\nThe framework:\nBuild when:\nThis is a core differentiator — it\u0026rsquo;s what your product does, and doing it better than a vendor is a competitive advantage The off-the-shelf solution is a poor fit (you\u0026rsquo;d spend more customizing than building) Data or security requirements make a third-party solution unacceptable (regulated industries, data residency) The vendor is a single point of failure for your core business Buy when:\nThis is undifferentiated infrastructure — logging, payments, email delivery, search, identity The vendor has years of reliability data you can\u0026rsquo;t replicate quickly The total cost of ownership (build + maintain + evolve) exceeds vendor cost It moves you faster to your actual differentiating work The hidden cost of build: Build has ongoing maintenance — every feature, every bug, every on-call incident, every security patch is yours. The \u0026ldquo;2 weeks to build\u0026rdquo; becomes \u0026ldquo;2 weeks to build + 2 years to maintain.\u0026rdquo;\nThe hidden cost of buy: Vendor lock-in, pricing changes, feature gaps that force workarounds, API changes that break your integration, vendor going out of business.\nThe EM answer: \u0026ldquo;My default is buy for commodity concerns — payments (Stripe), auth (Auth0/Cognito), observability (Datadog), email (SendGrid). Build when it\u0026rsquo;s genuinely core and when buy doesn\u0026rsquo;t meet the bar. The question I ask is: \u0026lsquo;Five years from now, do we want to be maintaining this or building the thing that\u0026rsquo;s actually our product?\u0026rsquo;\u0026rdquo;\nEvaluating a New Technology Proposal # A senior engineer wants to introduce a new technology. How do you evaluate it?\nThe questions to ask:\nWhat specific problem does it solve that we don\u0026rsquo;t already solve? If the answer is \u0026ldquo;it\u0026rsquo;s newer\u0026rdquo; or \u0026ldquo;more engineers are using it,\u0026rdquo; that\u0026rsquo;s not a problem definition — it\u0026rsquo;s trend-following.\nWhat\u0026rsquo;s the total cost of adoption? Migration of existing code, new expertise required, CI/CD changes, monitoring, on-call runbooks, licensing.\nWhat\u0026rsquo;s the blast radius if it doesn\u0026rsquo;t work? Can we roll it back? Is it isolated to one service or does it require system-wide changes?\nWho will own it? Every new technology needs an owner — someone who stays current, makes upgrade decisions, and is accountable when it breaks.\nWhat\u0026rsquo;s the reversibility? Technologies that are hard to remove (becomes the primary DB) deserve more scrutiny than ones that are easy to swap out.\nWhat\u0026rsquo;s the community and ecosystem trajectory? Betting on a declining technology is worse than using a \u0026ldquo;less cool\u0026rdquo; stable one.\nThe EM posture: Take proposals seriously — senior engineers are closest to the technical problems. But distinguish between solving a real problem and technical novelty. Run a time-boxed proof of concept with explicit success criteria before committing.\nTech Debt: Measuring, Prioritizing, and Selling It # What tech debt actually is: A deliberate or accidental decision to ship faster now at the cost of more work later. Not all tech debt is bad — some is intentional (MVP shortcuts to validate before investing). The problem is unintentional debt (code that was written fast and never cleaned up) and ignored debt (known issues never prioritized).\nMeasuring it: You can\u0026rsquo;t put an exact dollar figure on it, but you can measure proxies:\nCycle time for changes in the debt area (slow → high debt) Bug rate in the debt area (high → quality debt) Developer sentiment in retrospectives (\u0026ldquo;every sprint we fight the same fire\u0026rdquo;) Time spent on unplanned work Prioritizing it: Not all debt needs to be paid. Pay down debt that:\nIs in the critical path — touched every sprint, high blast radius when it fails Slows delivery measurably — engineers say \u0026ldquo;this would be easy if not for X\u0026rdquo; Has reliability implications — known instability, poor error handling, missing monitoring Is security debt — vulnerabilities that have been deferred Don\u0026rsquo;t pay down debt that:\nIs in rarely-touched code (stable legacy that works) Costs more to fix than to tolerate Will be replaced by a planned initiative anyway Selling it to the business:\nTranslate to business impact: \u0026ldquo;This component slows every feature by 2 sprints. In 6 months, we\u0026rsquo;ll ship 3 fewer features per quarter than we could. Fixing it takes 4 weeks and unlocks this pace permanently.\u0026rdquo; Don\u0026rsquo;t say \u0026ldquo;it\u0026rsquo;s the right thing to do.\u0026rdquo; Say \u0026ldquo;here\u0026rsquo;s what it\u0026rsquo;s costing us and here\u0026rsquo;s what we get back.\u0026rdquo; Propose a cadence: 20% of each sprint for reliability/debt, rather than a \u0026ldquo;debt sprint\u0026rdquo; that the business sees as a sprint with no value. Velocity vs Quality: The Tension # The business is pushing hard and wants features faster. You\u0026rsquo;re concerned about quality. How do you navigate?\nThe honest framing: Velocity and quality are in tension in the short term, but they\u0026rsquo;re correlated in the long term. Technical debt compounds. A team that ships 20% more features this quarter by cutting corners may ship 40% fewer features next quarter because of the bugs and slowdowns those corners created.\nThe data argument: \u0026ldquo;Our test coverage has dropped from 75% to 50% in the last quarter. Our production incident rate has tripled. Here\u0026rsquo;s the trend. If we continue at this pace, we\u0026rsquo;ll spend more time fighting fires than shipping features in 6 months.\u0026rdquo;\nThe practical negotiation:\nAgree on explicit quality gates — a feature is done when it has tests, monitoring, and a runbook. Non-negotiable. Make technical health a quarterly OKR, not just velocity. Push back on scope, not quality — \u0026ldquo;We can do features X and Y at quality, or X, Y, Z at lower quality. I recommend X and Y.\u0026rdquo; Team Disagreements: How to Resolve Without Losing the Dissenting Side # When the team is split between two technical approaches:\n1. Surface the actual disagreement. Often teams think they\u0026rsquo;re disagreeing about the solution when they\u0026rsquo;re actually disagreeing about the problem, the constraints, or the criteria for success. Get these explicit.\n2. Define decision criteria together. \u0026ldquo;We should choose the option that minimizes time-to-market, fits our team\u0026rsquo;s expertise, and is reversible within 6 months.\u0026rdquo; Now evaluate both options against the criteria.\n3. Time-box the discussion. Endless debate is worse than a suboptimal decision. \u0026ldquo;We\u0026rsquo;ll discuss this for one more meeting, then decide.\u0026rdquo;\n4. Make it reversible if possible. Start with the lower-stakes option. If it fails, course-correct. Avoid commitments that lock you in.\n5. Separate the decision from the person. \u0026ldquo;Your proposal lost\u0026rdquo; feels personal. \u0026ldquo;We chose option B and here\u0026rsquo;s why\u0026rdquo; is professional. Acknowledge the merits of the losing option explicitly.\n6. Give the dissenting side ownership. \u0026ldquo;You raised the strongest concerns about option B. I\u0026rsquo;d like you to own the monitoring strategy so we catch the failure mode you\u0026rsquo;re worried about early.\u0026rdquo; Converts skeptics into invested participants.\nRewrite vs Refactor vs Leave Alone # The most fraught decision in software. The rule of thumb attributed to Joel Spolsky: \u0026ldquo;Never rewrite from scratch. It\u0026rsquo;s the single worst mistake a software company can make.\u0026rdquo;\nWhy rewrites fail:\nThe existing system has encoded years of business rules, edge cases, and bug fixes that aren\u0026rsquo;t documented. The rewrite loses them. Rewrites take 2-3x longer than estimated. The business expects \u0026ldquo;6 months\u0026rdquo; and gets \u0026ldquo;18 months.\u0026rdquo; By the time the rewrite is done, requirements have changed. The rewrite team writes code that will eventually become the legacy system the next team wants to rewrite. When rewrite is legitimate:\nThe technology stack is genuinely end-of-life and unsupportable The architecture is fundamentally incompatible with current requirements (can\u0026rsquo;t add features without breaking everything) The cost of maintaining the existing system exceeds the cost of replacement You\u0026rsquo;re doing a Strangler Fig (incremental rewrite, not big bang) Strangler Fig pattern: Route traffic for individual features to the new system progressively. The old system shrinks; the new system grows. No big bang cutover. Much safer than \u0026ldquo;we go live on day X.\u0026rdquo;\nRefactor when:\nSpecific modules are painful and well-understood The overall architecture is sound but the implementation is messy You can refactor incrementally with tests as safety net Leave alone when:\nThe code works, nobody touches it, and the risk of introducing bugs exceeds the aesthetic cost of messy code \u0026ldquo;If it ain\u0026rsquo;t broke\u0026rdquo; is a valid engineering principle for stable code The Wrong Technical Decision Retrospective # \u0026ldquo;Tell me about a technical decision you made that turned out to be wrong. What did you learn?\u0026rdquo;\nWhat interviewers are looking for:\nSelf-awareness and intellectual honesty A structured understanding of why it was wrong (not just \u0026ldquo;it didn\u0026rsquo;t work\u0026rdquo;) What you changed in your decision-making process afterward That you don\u0026rsquo;t repeat the same class of mistake Framework for the answer:\nThe context and the decision you made What signals you had that it might be wrong (admit you had some) Why you made it anyway (time pressure, confidence bias, missing information) What happened (how did it fail, what was the impact) What you\u0026rsquo;d do differently (specific, not \u0026ldquo;I\u0026rsquo;d be more careful\u0026rdquo;) What process change or heuristic you now apply The worst answer: \u0026ldquo;I don\u0026rsquo;t make wrong technical decisions.\u0026rdquo; The second worst: \u0026ldquo;We moved fast and broke things, that\u0026rsquo;s how you learn.\u0026rdquo; The best answer demonstrates genuine reflection and a specific change in behavior.\n","date":"7 April 2026","externalUrl":null,"permalink":"/posts/system-design-basics/engineering-leadership-tradeoffs/","section":"Posts","summary":"EM interviews often end with “the harder framing” — questions about judgment, decision-making under pressure, and how you navigate disagreement. These don’t have right answers; they have reasoned answers that demonstrate how you think. Here’s a framework for the most common ones.\n","title":"Engineering Leadership Trade-offs: Build vs Buy, Tech Debt, and Rewrite vs Refactor","type":"posts"},{"content":"As systems grow, the gap between operational data (what your application uses to run) and analytical data (what your business uses to make decisions) becomes significant. Understanding how to design data pipelines that bridge this gap is an EM-level concern.\nOLTP vs OLAP: Fundamentally Different Read Patterns # OLTP (Online Transaction Processing):\nHandles operational workload — your application\u0026rsquo;s reads and writes Optimized for: fast, low-latency reads and writes on individual rows or small sets Schema design: normalized (3NF) to minimize write anomalies Example queries: \u0026ldquo;Get user #12345\u0026rdquo;, \u0026ldquo;Insert new order\u0026rdquo;, \u0026ldquo;Update inventory for SKU ABC\u0026rdquo; Database: PostgreSQL, MySQL, DynamoDB OLAP (Online Analytical Processing):\nHandles analytical workload — reporting, BI dashboards, data science Optimized for: fast reads across large datasets (millions/billions of rows), aggregations, GROUP BY, JOINs across large tables Schema design: denormalized (star schema, wide tables) to minimize JOIN cost at query time Example queries: \u0026ldquo;Revenue by country by week for the last 2 years\u0026rdquo;, \u0026ldquo;Cohort retention analysis\u0026rdquo;, \u0026ldquo;Funnel conversion rates\u0026rdquo; Database: BigQuery, Snowflake, Redshift, Databricks, ClickHouse Why they don\u0026rsquo;t mix: A complex analytics query (SELECT country, SUM(revenue) FROM orders JOIN users ... GROUP BY country) running on your OLTP database will hold locks, saturate I/O, and compete with your transactional workload. Running analytical queries on your production DB is a common early-stage pattern that breaks as the system scales.\nWhen You Need a Data Warehouse vs Querying Production Replicas # Production read replica — acceptable when:\nTeam is small, data volume is manageable (\u0026lt; tens of millions of rows) Analytical queries are infrequent and run off-hours The replica runs on a separate instance from the primary (doesn\u0026rsquo;t affect production reads) Query complexity is moderate — no multi-minute scans Data warehouse needed when:\nAnalytical queries take minutes and are run frequently (by multiple analysts/BI tools) You need to join data from multiple systems (orders from Postgres + events from Kafka + CRM from Salesforce) Historical data exceeds what fits efficiently in the OLTP database You need isolation — analytics should never touch production infrastructure Data must be transformed before use (cleansing, enrichment, aggregation) The data warehouse as a separate system: Data is extracted from operational systems, transformed, and loaded (ETL) or loaded then transformed (ELT). The warehouse has its own schema optimized for analytics. Analysts and BI tools query the warehouse, never production.\nBatch vs Streaming: The Decision # Batch processing: Process a large dataset in bulk, on a schedule. ETL jobs that run nightly, weekly aggregations, end-of-day reports.\nWhen batch is right:\nThe business insight doesn\u0026rsquo;t require real-time freshness (daily reports, weekly metrics) Processing is too expensive to run continuously (complex ML feature computation) The data volume is too large to process incrementally without windowing Idempotent: easy to re-run if it fails Tools: Spark, Flink (batch mode), dbt (SQL transforms), Airflow/Prefect for orchestration.\nStreaming processing: Process events as they arrive. A Kafka consumer reads events, applies logic, outputs results — latency measured in seconds, not hours.\nWhen streaming is right:\nReal-time dashboards (fraud alerts, system monitoring, live metrics) Event-driven business logic that must react quickly (inventory reservation, fraud detection, real-time recommendations) Continuous aggregations (rolling window metrics: \u0026ldquo;orders in the last 5 minutes\u0026rdquo;) Notification/alerting systems Tools: Apache Flink, Kafka Streams, Spark Structured Streaming, Apache Samza.\nThe streaming complexity cost: Exactly-once semantics, stateful stream processing, out-of-order event handling, watermarking for late events, checkpoint/state management — streaming requires expertise that batch doesn\u0026rsquo;t. Don\u0026rsquo;t use streaming \u0026ldquo;because it\u0026rsquo;s modern.\u0026rdquo; Use it when freshness requirements genuinely justify the complexity.\nLambda architecture (batch + streaming): Run both a batch layer (high accuracy, complete historical data) and a speed layer (real-time approximation). Results are merged. The goal: accuracy of batch, freshness of streaming. The cost: you maintain two systems, two code paths. Kappa architecture (streaming only) reduces this by making streaming the sole layer, reprocessing historical data via replay.\nChange Data Capture (CDC) # CDC captures the changes in a database (INSERT, UPDATE, DELETE) and publishes them as a stream of events. Instead of polling the database for changes, you receive them in real-time via the transaction log.\nHow it works (Postgres example):\nPostgres writes every change to its Write-Ahead Log (WAL) Debezium (the most popular CDC tool) reads the WAL via replication slot Changes are published as events to Kafka Consumers read from Kafka and react to the changes Postgres transaction → WAL → Debezium → Kafka Topic → Consumer INSERT into orders → → → {\u0026#34;op\u0026#34;:\u0026#34;c\u0026#34;, \u0026#34;after\u0026#34;: {\u0026#34;id\u0026#34;:1, \u0026#34;status\u0026#34;:\u0026#34;PLACED\u0026#34;}} Why CDC instead of dual-write (writing to both DB and Kafka)? Dual-write has a race condition — the DB write and the Kafka publish are not atomic. If the app crashes between them, you get inconsistency. CDC derives the event from the committed DB change — it only fires after the transaction commits. Guaranteed consistency.\nCDC use cases:\nEvent-driven microservices: Service B reacts to changes in Service A\u0026rsquo;s database without Service A sending explicit events. Reduces coupling. Data replication: Sync data from Postgres to Elasticsearch for search, Redis for cache, BigQuery for analytics — all via CDC pipeline. Audit trail: Every change to important entities captured without modifying application code. Cache invalidation: When a DB row changes, publish an event → cache consumer invalidates or updates the cache entry. Solves the dual-write invalidation problem. Operational considerations:\nReplication slots have backlog risk — if Debezium is down, the WAL replication slot accumulates. Postgres must keep WAL until the slot is consumed. Large backlogs can fill disk. Schema evolution — when you add a column to Postgres, the CDC schema must adapt. Avro schema registry handles this well. Ordering guarantees — within a partition, events are ordered. Across partitions, they\u0026rsquo;re not. Design consumers to handle out-of-order events for different entities. Alternatives to Debezium: AWS DMS (for RDS to Kafka/Kinesis), Google Datastream (GCP), Striim.\nThe Modern Data Stack # For context, the modern data engineering stack looks like:\nOperational DBs (Postgres, MySQL, DynamoDB) ↓ CDC (Debezium) or batch extract (Airbyte, Fivetran) Kafka / Event Stream ↓ Data Warehouse (BigQuery, Snowflake, Redshift) ↓ Transform (dbt — SQL-based transformations) BI Layer (Looker, Metabase, Mode) ↓ Dashboards / Reports EM-level framing: When a product manager asks \u0026ldquo;why don\u0026rsquo;t we have this analytics report?\u0026rdquo; the answer often involves one of these layers. Was the data never captured? Is it in the OLTP DB but not the warehouse? Is it in the warehouse but not transformed? Is it transformed but not surfaced in the BI tool? Understanding the stack helps you diagnose data availability problems and have informed conversations with data engineering teams.\n","date":"7 April 2026","externalUrl":null,"permalink":"/posts/system-design-basics/data-pipeline-analytics/","section":"Posts","summary":"As systems grow, the gap between operational data (what your application uses to run) and analytical data (what your business uses to make decisions) becomes significant. Understanding how to design data pipelines that bridge this gap is an EM-level concern.\n","title":"Data Pipeline and Analytics: OLTP vs OLAP, Batch vs Streaming, CDC","type":"posts"},{"content":"Testing strategy is an EM-level concern because it directly affects delivery velocity, production reliability, and onboarding speed. Too little testing = production incidents. Too much ceremony = slow CI and frustrated engineers. The goal is the right tests in the right places.\nThe Test Pyramid for Microservices # The classic test pyramid has unit tests at the base, integration tests in the middle, and end-to-end tests at the top. In microservices, the pyramid shifts slightly because the \u0026ldquo;integration\u0026rdquo; layer is where most of the real risk lives.\n/\\ /E2E\\ ← Few, slow, high confidence for critical paths /------\\ / Service \\ ← Medium — test one service with real dependencies / Integration\\ /--------------\\ / Unit Tests \\ ← Many, fast, test logic in isolation /------------------\\ Unit Tests (Base) # Test a single class or function in isolation. Fast (milliseconds), no I/O, no database, no HTTP.\nWhat belongs in unit tests:\nPure business logic — validation rules, calculations, transformations Complex conditional logic — branch coverage Edge cases and error paths Utility functions What doesn\u0026rsquo;t belong in unit tests:\n\u0026ldquo;Does this Spring bean wire correctly?\u0026rdquo; — that\u0026rsquo;s not a unit test, it\u0026rsquo;s integration \u0026ldquo;Does this SQL query return the right rows?\u0026rdquo; — needs a real database Anything that requires mocking more than 2 collaborators — usually a design smell Mocking: Use sparingly. Heavy mocking creates tests that are coupled to implementation rather than behavior. If you\u0026rsquo;re mocking 5 dependencies to test a single method, the method probably does too much.\nIntegration Tests (Middle) # Test a component with its real dependencies. In microservices, this typically means testing one service with a real database, real cache, but mocked or stubbed external services.\nTools:\nTestcontainers: Spin up real Postgres, Redis, Kafka in Docker for tests. Tests run against real infrastructure, same version as production. Eliminates the \u0026ldquo;it works locally but not in prod\u0026rdquo; class of bugs. Spring Boot Slice Tests: @DataJpaTest spins up only JPA components + in-memory DB. @WebMvcTest tests controllers without the full context. Faster than @SpringBootTest. @SpringBootTest with Testcontainers: Full integration test — the whole application + real DB/cache. When integration tests are worth more than unit tests:\nRepository/DAO layer — the actual SQL query behavior is what matters, not the Java code Database migrations — does the migration run without errors? Does the ORM still work after it? Configuration — does the Spring context load correctly with the production config? Request/response mapping — does the HTTP layer serialize/deserialize correctly? End-to-End Tests (Top) # Test complete user workflows across multiple services. Simulate a real user: create order → process payment → send confirmation.\nThe cost: Slow (minutes), flaky (dependent on all services being up), expensive to maintain (any service change may break unrelated E2E tests).\nWhen to use: Critical user journeys only. Checkout flow. Login/auth. Core CRUD for your primary entity. Not for every feature.\nAlternative: Component tests — test one service from its HTTP boundary with all dependencies (Testcontainers), treating it as a black box. This gives high confidence without cross-service fragility.\nIntegration vs Unit Tests: When Integration Tests Win # The temptation to mock everything results in a large unit test suite that passes while production is on fire. Integration tests catch the issues unit tests miss:\nORM mapping issues — your Java entity doesn\u0026rsquo;t match the DB schema SQL query correctness — the query you wrote doesn\u0026rsquo;t return what you think Transaction boundaries — two operations that should be atomic aren\u0026rsquo;t Serialization/deserialization — JSON fields don\u0026rsquo;t map correctly Database migration behavior — migration runs in prod but your unit tests use an in-memory H2 DB Connection pool exhaustion — tests that don\u0026rsquo;t clean up connections cause mysterious failures Rule of thumb: Anything that talks to a database should have an integration test, not a unit test with a mocked repository. The repository mock tests that you called save() — the integration test tests that the data was actually saved correctly.\nContract Testing with Pact # In a microservices system, service A calls service B\u0026rsquo;s API. When service B\u0026rsquo;s team changes the API, service A breaks. How do you catch this before it reaches production?\nConsumer-driven contract testing (Pact):\nConsumer writes a contract. Service A defines what it uses from service B\u0026rsquo;s API: the endpoint, request format, response fields it cares about. Contract is published to a Pact Broker (or Pactflow). Provider (service B) validates the contract. Service B\u0026rsquo;s CI runs the consumer\u0026rsquo;s contract against the real service. If the contract passes, service B can deploy safely. If it breaks, CI fails. Service A team writes: \u0026#34;I call GET /orders/{id} and expect { id, status, total }\u0026#34; Service B CI runs: \u0026#34;Does GET /orders/{id} still return { id, status, total }? YES → ok to deploy\u0026#34; \u0026#34;Did we rename \u0026#39;total\u0026#39; to \u0026#39;amount\u0026#39;? NO → contract broken → CI fails\u0026#34; When Pact is worth introducing:\nMultiple teams where the consumer and provider teams are different APIs change frequently and cross-team coordination is a bottleneck You can\u0026rsquo;t easily run all services together for integration tests When Pact is overkill:\nSmall team where you own all services — coordinate the change directly The API is very stable — overhead of maintaining contracts exceeds the bug-catching value You already have reliable E2E tests covering the integrations The EM conversation: \u0026ldquo;Pact is valuable when \u0026lsquo;did I break someone?\u0026rsquo; is a real question. If the answer is always \u0026lsquo;ask the team in Slack,\u0026rsquo; Pact adds process that manual coordination can handle. At scale, it replaces manual coordination.\u0026rdquo;\nCoverage: How Much Is Enough? # The honest answer: 100% code coverage mandates are often counterproductive. Coverage measures lines executed, not behavior validated. You can have 100% coverage with tests that assert nothing meaningful.\nWhat coverage does tell you:\nAreas of the codebase with zero tests — genuine risk Paths that are never executed in tests — good candidates for review What coverage doesn\u0026rsquo;t tell you:\nWhether the tests are testing the right behavior Whether the tested behavior is correct Whether edge cases are handled The pragmatic threshold:\nNew code should have tests for its intended behavior and error paths A coverage drop on a PR should trigger a review, not a hard failure Business-critical paths (checkout, payments, auth) should have higher coverage than admin utilities Legacy code: don\u0026rsquo;t mandate coverage; add tests when you touch a file (Boy Scout Rule) Pushing back on \u0026ldquo;100% coverage\u0026rdquo; mandates: \u0026ldquo;Coverage is a proxy metric, not a goal. We should be asking \u0026lsquo;are the critical behaviors tested?\u0026rsquo; not \u0026lsquo;is every line executed?\u0026rsquo; I\u0026rsquo;d rather have 70% coverage with tests that actually validate correctness than 100% coverage with tests that check implementation details.\u0026rdquo;\nTesting Distributed Systems # Testing a distributed system is qualitatively harder than testing a monolith. The failure modes you need to test don\u0026rsquo;t show up in unit tests: network partitions, timeouts, duplicate messages, out-of-order events.\nTestcontainers for realistic integration: Real Kafka, real Postgres, real Redis. Tests reflect what actually runs in production, not in-memory mocks that behave differently.\nChaos testing: Randomly inject failures in a controlled environment — kill a pod, add latency, drop network packets. Chaos Monkey, Chaos Mesh, AWS Fault Injection Simulator. The goal: discover failure modes before users do. Run in pre-prod, not in prod (until you\u0026rsquo;re mature).\nContract tests for service boundaries: Pact for API contracts. Reduces E2E test dependency.\nConsumer-side stub servers: Wiremock or MockServer — run a stub that returns pre-recorded responses from the real service. Useful for testing a consumer in isolation without the real service.\nThe hardest thing to test: \u0026ldquo;What happens when message X arrives twice?\u0026rdquo; \u0026ldquo;What happens when the DB is down for 30 seconds mid-operation?\u0026rdquo; These scenarios require intentional fault injection in tests.\nThe EM stance on test investment: The most valuable tests are the ones that catch bugs before production and run fast enough to not be skipped. A 30-minute CI pipeline that flaps 20% of the time is worse than a 5-minute pipeline with 80% coverage that everyone trusts. Invest in test stability and speed before coverage percentage.\n","date":"7 April 2026","externalUrl":null,"permalink":"/posts/system-design-basics/testing-strategy/","section":"Posts","summary":"Testing strategy is an EM-level concern because it directly affects delivery velocity, production reliability, and onboarding speed. Too little testing = production incidents. Too much ceremony = slow CI and frustrated engineers. The goal is the right tests in the right places.\n","title":"Testing Strategy: Test Pyramid, Contract Testing, and Coverage Pragmatics","type":"posts"},{"content":"How you deploy code is as important as how you write it. The gap between writing a feature and it running in production reliably is where most engineering organizations lose velocity. This post covers the decisions that shape that gap.\nTrunk-Based Development vs GitFlow # GitFlow # Long-lived branches: main, develop, feature branches, release branches, hotfix branches. Features are developed on branches, merged to develop, periodically merged to release branches, then to main.\nGitFlow was designed for versioned software releases — desktop applications, mobile apps with app store releases, libraries with semantic versioning. The release branch model makes sense when you control when customers get updates.\nGitFlow is wrong for continuously deployed web services. Long-lived feature branches create integration debt. The further a branch diverges from main, the more painful the merge. Release branches add ceremony without adding value when you deploy continuously.\nTrunk-Based Development (TBD) # All engineers work on short-lived branches (\u0026lt; 1 day ideally, max 2 days) and merge to main frequently. Main is always deployable. CI runs on every merge. Deploy from main.\nWhy TBD works:\nContinuous integration — conflicts surfaced when they\u0026rsquo;re small, not after 2 weeks of divergence Always-releasable main branch — deployment is a operational decision, not a coordination event Forces small, incremental changes which are easier to review, test, and rollback Matches the Git design intention — frequent small merges, not large infrequent ones The prerequisite: Strong CI. Every merge to main must pass tests automatically. If CI is slow or unreliable, engineers avoid merging frequently — which defeats TBD.\nFeature flags enable TBD at scale: Incomplete features are merged to main behind a flag. The code ships but is invisible to users until the flag is enabled.\nThe EM stance: For web services with continuous deployment, trunk-based development is the right default. GitFlow is appropriate for versioned software. Enforce short-lived feature branches by policy (auto-delete merged branches, flag any branch \u0026gt; 3 days old).\nBlue-Green vs Canary vs Rolling Deployments # Rolling Deployment # Gradually replace old instances with new ones. At any moment, some instances run the old version and some run the new.\nStart: [v1, v1, v1, v1] Step 1: [v2, v1, v1, v1] Step 2: [v2, v2, v1, v1] Step 3: [v2, v2, v2, v1] Done: [v2, v2, v2, v2] Advantages: No extra infrastructure cost (no idle environment). Simple in Kubernetes (default strategy).\nDisadvantages: Old and new versions run simultaneously — any API contract changes must be backwards compatible. Rollback requires rolling back all instances (takes time). Not suitable for migrations that break old code.\nBlue-Green Deployment # Two identical environments: blue (current) and green (new). Switch traffic from blue to green atomically via load balancer update.\nBefore: traffic → Blue (v1) Deploy: green (v2) warmed up, tested Switch: traffic → Green (v2) Blue: stands by for instant rollback Advantages: Instant rollback (flip back to blue). No version mixing — all traffic goes to one version at a time. Blue environment can be used for smoke testing before cutover.\nDisadvantages: Double infrastructure cost during deployment. Database migrations must be compatible with both blue and green simultaneously (if blue is in standby, rollback means old code runs against the new schema).\nCanary Deployment # Send a small percentage of traffic to the new version, gradually increase if metrics are good.\nStart: 100% v1 Canary: 95% v1, 5% v2 → observe metrics Expand: 75% v1, 25% v2 → observe Continue: 0% v1, 100% v2 Advantages: Real production traffic validates the new version. Failure impact is limited to the canary percentage. Automatic rollback when error rate exceeds threshold.\nDisadvantages: Complex to implement (requires traffic splitting at ingress/load balancer level, or feature flags). Observability needed to compare v1 vs v2 metrics side by side. Not suitable for high-blast-radius changes.\nTools: Argo Rollouts (Kubernetes), Flagger, AWS CodeDeploy canary, LaunchDarkly.\nWhen to Use Which # Scenario Strategy Low-risk changes, simple rollback acceptable Rolling High-risk change, need instant rollback Blue-green Gradual confidence building in prod Canary DB schema change, backwards-compat required Rolling + expand compatibility first Full replacement with smoke testing Blue-green Feature Flags vs Branch-Based Releases # Feature flags decouple code deployment from feature activation. The code is deployed to production but the feature is inactive until the flag is enabled.\nFeature flags solve:\nTrunk-based development for incomplete features A/B testing (enable for 50% of users) Targeted rollout (enable for internal users first, then by country, then globally) Kill switch — instantly disable a misbehaving feature without deployment Separation of deployment (engineering event) from release (business event) The trade-off: Flags accumulate. A codebase with 200 stale feature flags is hard to reason about. Establish a lifecycle: every flag has an owner and a removal date. Flags should be short-lived (days to weeks for launch flags, long-lived for kill switches and operational toggles).\nFeature flag services: LaunchDarkly, Optimizely, AWS AppConfig, Unleash (open source), or a simple database/config table for basic use cases.\nZero-Downtime Database Migrations # Database migrations are the hardest part of zero-downtime deployments. The standard approach is the expand-contract pattern (also called \u0026ldquo;parallel change\u0026rdquo;).\nThe problem: If you rename a column, the new code needs the new name, the old code needs the old name. During a rolling deployment, both versions run simultaneously — both must work against the same DB.\nThe Expand-Contract Pattern # Phase 1: Expand (backwards-compatible addition)\nAdd the new column (nullable, with default) Start writing to both old and new columns Deploy — old code reads old column, new code reads new column Both coexist, database has both Phase 2: Migrate data\nBackfill the new column from the old column (use batched migration, not a single UPDATE that locks the table) Verify data integrity Phase 3: Contract (removal)\nDeploy code that only uses the new column Once all old-code instances are gone (rolling deployment complete), drop the old column in a separate migration Total time: 2–3 deployments over days/weeks. Slower than a simple ALTER TABLE RENAME COLUMN, but zero downtime and instant rollback at every step.\nAdditive-Only Schema Changes (Safe in Rolling Deploys) # Adding a nullable column Adding a new table Adding an index (CONCURRENTLY in Postgres — no table lock) Adding a new enum value (be careful — some ORMs break on unknown enum values) Dangerous Schema Changes (Require Expand-Contract or Maintenance Window) # Renaming a column or table Removing a column (old code still references it) Changing a column type Making a nullable column NOT NULL (without a default or backfill) Tooling # Flyway / Liquibase: Version-controlled migration scripts. Run as part of deployment. Good for most teams — migrations are in source control alongside the code.\nBest practice: Never run migrations in the application startup. Run them as a separate init container or pre-deployment step. Application startup should be fast and deterministic; migrations can be slow and irreversible.\nMonorepo vs Polyrepo # Monorepo # All services in one repository. Google, Meta, and Twitter (X) use large monorepos.\nAdvantages:\nAtomic cross-service changes. Change the API contract and update all consumers in one commit. Unified tooling, standards, and dependency management. One version of a library used everywhere. Easier code sharing and refactoring across service boundaries. Simpler discovery — one place to search all code. Single CI/CD pipeline (with build graph optimization — only build changed services). Disadvantages:\nScale challenges. Naive monorepo tooling (running all tests on every commit) breaks at scale. Need Bazel, Nx, Turborepo, or similar build graph tools. Clone and IDE performance. A 10GB repository is slow to clone and index. Access control is harder. Restricting who can modify what requires CODEOWNERS or custom checks. Polyrepo # Each service in its own repository.\nAdvantages:\nSimpler per-service tooling and CI. Clear ownership boundaries (repo = team). No build system complexity for incremental builds. Disadvantages:\nCross-service changes require PRs across multiple repos — coordination overhead. Dependency management is hard — keeping library versions consistent across repos. Code discovery is harder — where does this function live? Duplication of boilerplate and configuration (CI templates, linting config, etc.). The EM take: The choice often depends on team scale and discipline. A small, cohesive team in a monorepo moves fast. A large organization with independent team ownership often works better with polyrepo (or a hybrid: monorepo per domain, polyrepo across domains). Don\u0026rsquo;t choose monorepo unless you\u0026rsquo;re prepared to invest in build tooling (Bazel, Nx, Turborepo).\n","date":"7 April 2026","externalUrl":null,"permalink":"/posts/system-design-basics/build-deploy-release/","section":"Posts","summary":"How you deploy code is as important as how you write it. The gap between writing a feature and it running in production reliably is where most engineering organizations lose velocity. This post covers the decisions that shape that gap.\n","title":"Build, Deploy, and Release: Trunk-Based Dev, Deployment Strategies, Zero-Downtime DB Migrations","type":"posts"},{"content":"Cloud infrastructure decisions are often more political than technical. The right answer depends on where your team\u0026rsquo;s expertise is, what your customers require, and what you\u0026rsquo;re willing to operate. Here\u0026rsquo;s how to frame these decisions at the EM level.\nAWS vs GCP vs Azure: Does It Actually Matter? # For most workloads, the difference between the big three is smaller than the cloud marketing suggests. Compute (VMs, containers, managed Kubernetes) is broadly equivalent. Managed databases, object storage, networking — table stakes at all three.\nWhere the differences are real:\nAWS:\nLargest ecosystem of managed services — if it exists as a managed service, AWS probably has it Largest community, most third-party tooling, most engineers with AWS experience Most mature Kubernetes managed service (EKS) in terms of enterprise features Best track record for exotic instance types (GPU, FPGA, high-memory, ARM) The default choice when there\u0026rsquo;s no other constraint GCP:\nBigQuery is genuinely differentiated — serverless data warehouse at massive scale with simple pricing Kubernetes is Google\u0026rsquo;s technology — GKE is polished and often ahead of EKS/AKS on new features Strong ML/AI infrastructure (TPUs, Vertex AI) if you\u0026rsquo;re building AI workloads Often less expensive than AWS at scale (especially for networking and egress) Less enterprise market share = fewer engineers to hire with GCP experience Azure:\nThe enterprise default — if your customers are Microsoft shops, Azure Active Directory integration alone drives this choice Best for .NET / Windows workloads, SQL Server, Active Directory integration Deep GitHub, DevOps, Visual Studio integrations Often the winner in regulated industries and government (FedRAMP, compliance certifications) The EM answer: \u0026ldquo;Which cloud depends on your team\u0026rsquo;s existing expertise, your customers\u0026rsquo; requirements, and any compliance constraints. For a greenfield startup with no constraints, I\u0026rsquo;d lean AWS for ecosystem breadth. For an enterprise software company, Azure integrates best with customer environments. For data-heavy or ML-heavy workloads, GCP\u0026rsquo;s tooling is strong.\u0026rdquo;\nKubernetes vs ECS vs Serverless vs VMs # Kubernetes # The industry-standard container orchestration platform. Self-healing, auto-scaling, declarative config.\nKubernetes wins when:\nYou have multiple services that benefit from unified orchestration (deployment, scaling, service discovery, configuration) Your team has or can build Kubernetes operational expertise You need advanced deployment strategies (canary, blue-green via Argo Rollouts) You want workload portability (run locally, on-prem, or any cloud) You want to add a service mesh, advanced networking, or custom admission controllers The cost: Kubernetes is complex. The control plane (managed on EKS/GKE/AKS), worker nodes, networking (CNI), storage (CSI), secrets management, ingress, monitoring — each layer requires understanding and maintenance. Managed Kubernetes reduces but doesn\u0026rsquo;t eliminate this.\nThe honest guideline: If you have fewer than 5–10 services or a small team, Kubernetes is likely overkill. It pays off at scale or when you have multiple teams deploying independently.\nAWS ECS (Elastic Container Service) # Simpler container orchestration, AWS-proprietary. Run containers on EC2 (ECS on EC2) or fully serverless (AWS Fargate).\nECS + Fargate wins when:\nYou\u0026rsquo;re AWS-native and want the simplest container hosting You don\u0026rsquo;t need Kubernetes features (advanced scheduling, custom CRDs, service mesh) You want truly serverless container hosting (Fargate handles infrastructure) Your team doesn\u0026rsquo;t want to manage Kubernetes Limitation: AWS-only. No portability. Less ecosystem than Kubernetes (no Helm charts, Argo, Tekton, etc.).\nServerless Functions (Lambda, Cloud Functions, Cloud Run) # Code runs on-demand. No servers to manage, pay per invocation.\nLambda wins when:\nEvent-driven processing — process S3 events, SQS messages, DynamoDB streams, API calls Infrequent or highly variable workloads — scales to zero (pay nothing when idle), scales to thousands of concurrent executions instantly CLI tools, scheduled jobs — no need for an always-on process Startup time is acceptable — typical cold starts are 100–500ms, configurable with provisioned concurrency Stateless operations — functions are ephemeral; no local state between invocations Lambda\u0026rsquo;s limitations:\nMax execution time: 15 minutes per invocation — not for long-running jobs Cold start latency: The first invocation (or after a period of inactivity) takes longer. Provisioned concurrency eliminates this but adds cost. Container egress: Lambda in a VPC for DB access requires NAT Gateway — adds cost and latency Observability is harder — function logs are per-invocation; distributed tracing requires explicit instrumentation Not for always-on services — if your service has constant traffic, an always-on container is cheaper Cloud Run (GCP): HTTP-based container hosting that scales to zero. A middle ground — you bring your container, Cloud Run handles scaling, including scale-to-zero. Less cold start than Lambda for containerized workloads.\nVMs (EC2, Compute Engine) # Still valid. For stateful workloads, databases, workloads requiring specific kernel configuration, or when you need maximum control.\nWhen VMs make sense:\nRunning databases self-hosted (you need I/O tuning, kernel parameters) Workloads requiring low-level performance tuning (huge pages, NUMA awareness, specific kernel versions) Legacy applications that can\u0026rsquo;t be containerized When containerization overhead matters (extreme performance workloads) Service Mesh: What It Solves and When It\u0026rsquo;s Overkill # A service mesh (Istio, Linkerd, Consul Connect) moves cross-cutting concerns out of application code and into the infrastructure layer.\nWhat a service mesh gives you:\nmTLS automatically — every service-to-service call is encrypted and authenticated Traffic management — canary deployments, traffic splitting, retries, timeouts at the mesh level (no code changes) Observability — automatic metrics and traces for every service-to-service call without application instrumentation Circuit breaking and load balancing — at the sidecar level, not in your code Authorization policies — \u0026ldquo;service A is allowed to call service B; service C is not\u0026rdquo; The cost:\nOperational complexity — Istio especially is known for being complex to operate. Misconfigured Istio has caused more production incidents than it has prevented for teams without the expertise. Sidecar overhead — each pod gets a sidecar container (Envoy). Small CPU/memory overhead per pod (~50MB, ~5ms per request). Debugging complexity — when traffic doesn\u0026rsquo;t flow correctly, diagnosing mesh config vs app config vs network is non-trivial. When it\u0026rsquo;s worth it:\nYou have 10+ services with serious cross-cutting concerns (mTLS, traffic management, observability) You have a dedicated platform engineering team to operate the mesh Compliance requires service-level identity and encryption You want canary/blue-green deployments without application code changes When it\u0026rsquo;s overkill:\nSmall team, few services You don\u0026rsquo;t need all the features — if you just want mTLS, Linkerd is much simpler than Istio Your team will spend more time debugging the mesh than building features The lightweight alternative: Linkerd (much simpler to operate than Istio), or just network policies + mutual TLS at the application level for critical paths.\nMulti-Cloud: Smart Hedge or Expensive Distraction? # The case for multi-cloud:\nAvoid vendor lock-in Regulatory requirements to use multiple clouds Different clouds have genuinely better services for different workloads (GCP for ML + AWS for primary) Negotiating leverage with cloud providers The reality:\nRunning workloads across multiple clouds requires abstraction layers (Terraform, Kubernetes) that add complexity Managed services (S3, RDS, BigQuery) are cloud-specific — true portability means avoiding them, which means missing significant managed service value Most teams who commit to multi-cloud spend significant engineering time on the portability layer, not the product Cloud vendor lock-in is real but overestimated — the cost of migration is high but so is the cost of operating two cloud environments The honest EM answer: \u0026ldquo;Multi-cloud sounds strategic but is operationally expensive. I\u0026rsquo;d choose the right cloud for our workload, invest in infrastructure-as-code (Terraform/Pulumi) so we could migrate if forced, and avoid proprietary managed services only where the lock-in risk outweighs the operational simplicity. Using two clouds for genuinely different purposes (e.g., AWS for the product + GCP BigQuery for analytics) is reasonable and different from \u0026rsquo;everything runs on both clouds.'\u0026rdquo;\n","date":"7 April 2026","externalUrl":null,"permalink":"/posts/system-design-basics/cloud-infrastructure/","section":"Posts","summary":"Cloud infrastructure decisions are often more political than technical. The right answer depends on where your team’s expertise is, what your customers require, and what you’re willing to operate. Here’s how to frame these decisions at the EM level.\n","title":"Cloud and Infrastructure: AWS vs GCP vs Azure, Kubernetes vs Serverless","type":"posts"},{"content":"Security architecture decisions have higher stakes than most — the cost of getting them wrong is a data breach, not a performance degradation. This post covers the trade-offs that come up in EM-level interviews: authentication approaches, identity protocols, and secrets management.\nSession-Based vs JWT: The Real Trade-offs # Both are valid. The choice depends on your consistency requirements and architecture.\nSession-Based Authentication # The server stores session state. On login, the server creates a session record (in DB or Redis) and sends a session cookie. On each request, the cookie is sent and the server looks up the session.\nAdvantages:\nImmediate revocation. Delete the session record → user is logged out globally, right now. This is the critical advantage. Simple invalidation on security events — password change, suspicious activity detection → delete all sessions. Server controls session lifetime — extend, shorten, or terminate based on server-side logic. No sensitive data on the client. The session cookie is just a random ID. Disadvantages:\nRequires shared session store for stateless/multi-instance deployments. Redis is the standard solution. Every request hits the session store — adds ~1ms latency per request (Redis round trip). This is usually fine. Doesn\u0026rsquo;t work well for mobile apps or non-browser clients where cookie management is manual. JWT (JSON Web Token) # Tokens are self-contained: they carry claims (user ID, roles, expiry) and are cryptographically signed by the server. No server-side state is needed to validate — just verify the signature.\nAdvantages:\nStateless validation — no session store lookup. Validates purely from the token + signing key. Works naturally across domains — mobile apps, SPAs, third-party integrations. Embeds user context — downstream services can extract claims without calling an auth service. Disadvantages:\nRevocation is hard. There\u0026rsquo;s no registry of valid tokens — a signed token is valid until expiry. If you need to log out a user mid-session (compromised account, password reset), you either: Wait for expiry (can be hours) Maintain a blocklist (now you have state again, defeating the purpose) Use short expiry (5–15 minutes) with refresh tokens (complexity) Token size — JWTs in cookies or headers can be large (especially with many claims). Not a problem in practice, but worth knowing. Algorithm confusion attacks — if JWT validation is implemented incorrectly (accepting alg: none, not validating the algorithm), it\u0026rsquo;s exploitable. Use a well-tested library, never roll your own JWT validation. When to use which:\nWeb apps with server-rendered content or same-domain SPA: Session cookies. Revocation is clean, CSRF protection is straightforward with SameSite=Strict. Mobile apps, SPAs calling third-party APIs, OAuth2 flows: JWTs. Stateless, portable. Microservices: JWTs for service-to-service claims propagation. API gateway validates the JWT once; downstream services trust the claims without calling auth. Revocation requirement is hard: Lean toward sessions or short-lived JWTs (\u0026lt; 5 minutes) with refresh token rotation. OAuth2 vs OIDC vs SAML # These are three different protocols solving overlapping but distinct problems.\nOAuth2: Authorization # OAuth2 is an authorization framework — it answers \u0026ldquo;can application X access resource Y on behalf of user Z?\u0026rdquo;\nThe flows:\nAuthorization Code flow (+ PKCE for public clients): The standard flow for web apps and mobile apps. User authenticates with the auth server, gets an authorization code, app exchanges it for tokens. Client Credentials flow: Machine-to-machine. No user involved. Service A gets a token to call Service B. Device flow: For devices without browsers (CLI tools, smart TVs) — user authenticates on a separate device. Key tokens:\naccess_token: Short-lived (minutes to hours), used to call APIs. Should be opaque to clients or a JWT. refresh_token: Long-lived, used to get new access tokens without re-authentication. OIDC (OpenID Connect): Authentication on Top of OAuth2 # OIDC adds an identity layer to OAuth2. It answers \u0026ldquo;who is this user?\u0026rdquo; by introducing the id_token (a JWT with user claims: sub, email, name).\nUse OIDC when you need to: authenticate users via a third-party identity provider (Google, GitHub, Azure AD), implement SSO across your applications, or get standard user profile information.\nOIDC vs OAuth2: OAuth2 alone tells you an app can access a resource. OIDC tells you who the user is. For user authentication (login), use OIDC. For API access delegation, use OAuth2.\nSAML: Enterprise SSO # SAML (Security Assertion Markup Language) is the older enterprise SSO standard. XML-based, stateful, tightly coupled to browser-redirect flows.\nWhen you encounter SAML: Enterprise customers requiring SSO integration with their corporate identity providers (Active Directory, Okta, PingIdentity). You don\u0026rsquo;t choose SAML — your enterprise customer requires it.\nSAML vs OIDC: OIDC is the modern alternative. If your enterprise customers support OIDC, use it. SAML is harder to implement correctly, XML is verbose, and the tooling is older. Many identity providers now support both.\nWhere Authentication Belongs in Microservices # Three options, each with different trade-offs:\nOption 1: Each service validates tokens independently Every service has the JWT signing key and validates tokens itself. Simple, no single point of failure. But: every service needs the signing key (key distribution problem), every service reimplements the same validation logic (risk of inconsistency), and adding a claim or changing validation logic requires updating every service.\nOption 2: API Gateway handles authentication The gateway validates the JWT. Downstream services receive a trusted header (X-User-ID, X-User-Roles) and trust it without revalidation. Centralizes auth concern, simplifies services.\nThe risk: If a service is reachable without going through the gateway (direct internal calls, misconfigured networking), it\u0026rsquo;ll accept requests without auth. Mitigation: network policy restricts direct access; mTLS between services.\nOption 3: Service Mesh handles authentication (mTLS + SPIFFE/SPIRE) The mesh enforces mTLS between all services. Services only accept connections from other services with a valid mesh certificate. Identity is proven at the transport layer, not the application layer. Combine with JWT validation at the gateway for user identity.\nThe recommendation for most teams: API gateway handles JWT validation + extracts user context. Pass user identity as trusted headers to downstream services. Add mTLS via service mesh if you need service-to-service authentication beyond network policy.\nSecrets Management # Secrets (API keys, DB passwords, signing keys, certificates) are the most sensitive assets in your system. Where they live determines your blast radius when they\u0026rsquo;re compromised.\nAntipattern: Secrets in code / environment variables hardcoded in deployment manifests\nCommitted to Git → permanent exposure in history Visible to anyone with repo access No rotation without redeployment The tiers of secrets management:\nTier 1: Cloud-native secret stores\nAWS Secrets Manager, Azure Key Vault, GCP Secret Manager Secrets stored encrypted, access controlled via IAM Automatic rotation for supported services (RDS passwords, for example) Audit trail of all access Injected into workloads at runtime via SDK or container init Tier 2: HashiCorp Vault\nSelf-hosted, cloud-agnostic secret store Dynamic secrets — generate short-lived DB credentials on demand instead of shared long-lived passwords Kubernetes integration — applications authenticate to Vault using their K8s service account token Sophisticated policy engine, full audit log Operational overhead of running Vault itself (though Vault\u0026rsquo;s HA mode and HCP Vault reduce this) Tier 3: Kubernetes Secrets\nBase64-encoded (not encrypted by default) ConfigMaps with tighter access control Must enable etcd encryption at rest to actually secure them External Secrets Operator: sync from AWS Secrets Manager / Vault into Kubernetes Secrets — best of both worlds Rotation strategy: Every secret should have a rotation plan. Database passwords rotated quarterly minimum (monthly for high-value systems). API keys with rotation support should be rotated regularly. Certificates should auto-rotate (cert-manager + Let\u0026rsquo;s Encrypt or internal CA).\nmTLS Between Services: Worth It? # Mutual TLS authenticates both client and server. Each service presents a certificate; both sides verify.\nWhat mTLS gives you:\nService identity — you know who\u0026rsquo;s calling, not just that the call came from within the cluster Encryption of inter-service traffic (important if network isn\u0026rsquo;t fully trusted) Defense against a compromised pod injecting traffic — without a valid certificate, connections are rejected The implementation path:\nManual: Generate CAs, issue certificates per service, rotate them — operationally expensive, error-prone Service mesh (Istio, Linkerd): mTLS is automatic. The mesh issues certificates (SPIFFE SVIDs) to each service and handles rotation. Zero application code change. When mTLS is worth it:\nRegulated industries (PCI, HIPAA) where network-level encryption and service identity are required Large microservice architectures where zero-trust networking matters When you\u0026rsquo;re already running a service mesh (the cost is nearly zero — mTLS is built in) When it\u0026rsquo;s overkill:\nSmall teams, few services, already have network policies restricting access The operational overhead of managing certificates or running a service mesh isn\u0026rsquo;t justified The honest take: if you\u0026rsquo;re running Kubernetes and have the resources to operate Istio, turn on mTLS. It\u0026rsquo;s free security. If you\u0026rsquo;re a small team running 5 services, network policies + JWT validation at the gateway is probably enough.\n","date":"7 April 2026","externalUrl":null,"permalink":"/posts/system-design-basics/security-authentication/","section":"Posts","summary":"Security architecture decisions have higher stakes than most — the cost of getting them wrong is a data breach, not a performance degradation. This post covers the trade-offs that come up in EM-level interviews: authentication approaches, identity protocols, and secrets management.\n","title":"Security and Authentication: JWT, OAuth2, and Secrets Management","type":"posts"},{"content":"Observability is the ability to understand what\u0026rsquo;s happening inside your system from the outside — from its outputs. The three pillars (logs, metrics, traces) are complementary tools, each answering different questions. Getting the combination right is what separates systems that you can reason about from systems that require tribal knowledge to debug.\nLogs vs Metrics vs Traces: What Each Gives You # Logs # Logs are the raw record of events — timestamped, structured or unstructured, per-request or system-level.\nWhat logs answer: \u0026ldquo;What exactly happened at time T in service S?\u0026rdquo; Detailed, contextual, narrative.\nStructured logs: JSON-formatted logs (vs plain text) make logs queryable and filterable at scale. With plain text, you need regex. With structured logs, you query fields: service=checkout AND user_id=123 AND level=ERROR.\nThe ideal log statement includes:\nTimestamp (ISO 8601, UTC) Service name, instance ID Trace ID and Span ID (for correlation with traces) Log level (DEBUG/INFO/WARN/ERROR) Message Contextual fields (user_id, order_id, request_id) What logs don\u0026rsquo;t give you: Aggregated views, trends, performance over time. Searching logs at scale is slow and expensive.\nMetrics # Metrics are numeric measurements over time — counters, gauges, histograms. Designed for aggregation and trending.\nWhat metrics answer: \u0026ldquo;How is the system performing right now, and how does it compare to yesterday?\u0026rdquo; Quantitative, aggregatable, cheap to store (numbers, not text).\nThe four golden signals (Google SRE):\nLatency: Time to serve a request (differentiate successful vs error latency) Traffic: Volume of requests (rps, tps) Errors: Rate of failed requests Saturation: How \u0026ldquo;full\u0026rdquo; the service is (CPU %, queue depth, connection pool usage) Histograms vs averages: Average latency hides the tail. P95 and P99 tell the real story. A system with p50 latency of 10ms and p99 of 2000ms has a serious problem the average doesn\u0026rsquo;t reveal. Always alert on and discuss percentiles.\nMicrometer: The standard metrics facade for Java/Spring Boot. Code emits metrics once; you plug in any backend (Prometheus, Datadog, CloudWatch) via a dependency. Never write System.out.println(\u0026quot;count: \u0026quot; + count) for metrics — use a proper metrics library.\nTraces # Traces follow a request across multiple services — a single logical operation broken into spans, each representing work in one service or component.\nWhat traces answer: \u0026ldquo;Where in this multi-service chain did my request spend its time, and which service caused the latency?\u0026rdquo;\nRequest (total: 450ms) ├── API Gateway (5ms) ├── UserService (15ms) ├── OrderService (300ms) │ ├── DB query (280ms) ← the bottleneck │ └── Cache lookup (20ms) └── NotificationService (130ms) ← async, not in critical path Without traces, you\u0026rsquo;d know the overall request was slow (from metrics) but not which service or operation caused it.\nImplementation: OpenTelemetry is the standard — vendor-neutral instrumentation. Spring Boot 3 auto-instruments common operations (HTTP requests, JDBC queries, Redis calls). Export to Jaeger, Tempo, Zipkin, or commercial APMs (Datadog APM, New Relic).\nWhen is tracing worth the cost? Almost always in production microservices. The instrumentation overhead is \u0026lt; 1% CPU/memory for typical workloads. The debugging time saved on the first production incident more than pays for the setup cost. The question isn\u0026rsquo;t whether to trace — it\u0026rsquo;s which backend to use.\nDebugging a Slow Service When No Alerts Are Firing # This is a common interview question. Your systematic approach:\n1. Is this a p50, p95, or p99 problem? Check latency percentiles. If p50 is fine but p99 is bad, it\u0026rsquo;s intermittent — probably GC pause, lock contention, or specific request patterns. If p50 is bad, it\u0026rsquo;s systematic.\n2. Check the four golden signals for the service itself and its dependencies:\nIs traffic volume normal? Is error rate elevated (even slightly)? Is saturation high (thread pool, DB connection pool, CPU)? 3. Look at traces for slow requests. Where is the time going? Which span is long?\n4. Check downstream dependencies. Service is slow because the DB is slow? Check DB query time, lock waits, replication lag. Cache is slow? Check Redis latency and hit rate.\n5. Correlate with deployments. Did someone deploy in the last hour? Check the diff.\n6. Infrastructure-level signals. Is this one pod or all pods? (One pod = instance-specific issue — maybe a JVM GC issue). Is there a correlation with time of day or traffic pattern?\n7. JVM-specific for Java services. GC logs — are there long pauses? Thread dump — are threads blocked on something? Heap profiler — is memory pressure causing thrashing?\nWhat to Alert On: Good vs Bad Alerts # The alert quality test: If the alert fires at 3am, should a human wake up to handle it? If yes, it\u0026rsquo;s a good alert. If it can wait until morning or is often a false positive, it shouldn\u0026rsquo;t page.\nGood alerts:\nError rate \u0026gt; 1% for \u0026gt; 5 minutes (user-visible impact) P99 latency \u0026gt; SLO for \u0026gt; 5 minutes Availability check fails (the service returns errors or is unreachable) Queue consumer lag growing for \u0026gt; 10 minutes (work is backing up) DLQ depth \u0026gt; 0 (poison messages need investigation) Certificate expiry \u0026lt; 14 days (proactive, not reactive) Bad alerts:\nCPU \u0026gt; 80% (resource metrics without user impact — just because CPU is high doesn\u0026rsquo;t mean users are affected) \u0026ldquo;Server restarted\u0026rdquo; (if autoscaling or Kubernetes restarts are expected, this is noise) Alerts without a clear remediation action (\u0026ldquo;what do I do if this fires?\u0026rdquo;) Alerts that fire constantly and get ignored (alert fatigue — worse than no alerts) Very tight thresholds that fire on minor blips Symptom-based vs cause-based alerts:\nSymptom-based (recommended): \u0026ldquo;Users can\u0026rsquo;t complete checkout\u0026rdquo; — fires when the user-observable outcome is broken Cause-based: \u0026ldquo;DB connection pool \u0026gt; 90%\u0026rdquo; — may or may not mean users are affected Alert on symptoms. Use cause-based metrics as diagnostic tools to investigate why the symptom alert fired.\nDistributed Tracing: When Is It Worth It? # It\u0026rsquo;s almost always worth it for microservices. The specific scenarios where it\u0026rsquo;s indispensable:\nLatency debugging — identifying which service in a 10-service chain caused a slowdown Error propagation — understanding how an error in a downstream service surfaces to the user Dependency mapping — discovering which services actually call which (as opposed to what the architecture diagram says) SLO breakdown — attributing latency budget to specific services/operations The cost:\nInstrumentation time (~1 sprint to set up, less for Spring Boot 3 which auto-instruments) Sampling strategy needed at scale — tracing every request is expensive. Sample 10% normally, 100% for errors and slow requests (tail-based sampling). Storage cost for traces — traces are large compared to metrics. Retention is typically 7–30 days. OpenTelemetry collector: The standard deployment pattern is to run an OTel Collector sidecar or DaemonSet in Kubernetes. Services emit spans to the collector; the collector batches and forwards to your backend. This decouples your application from the specific tracing backend.\nPII in Logs # This is a compliance and security issue that every EM should have a clear stance on.\nNever log:\nPasswords, tokens, API keys (even hashed — logging a hash of a password is still bad practice) Full payment card numbers, CVVs SSNs, government IDs Health information Be careful with:\nEmail addresses (PII in GDPR, CCPA, HIPAA contexts) IP addresses (PII in some jurisdictions) User IDs (if linked to a real person, they\u0026rsquo;re PII — but generally safer to log as a reference) Full request/response bodies (may contain any of the above) Practical patterns:\nLog field masking: Middleware that strips or masks known PII fields (password, creditCard, ssn) from structured logs Log level control: Don\u0026rsquo;t log request bodies at INFO — only at DEBUG, which should be disabled in production Data classification: Tag log fields by sensitivity. Only certain teams can access logs with PII-tagged fields. Correlation IDs, not user data: Log the user ID reference (a UUID), not the email or name. Join to user data only when necessary for debugging. Log retention limits: Keep DEBUG/INFO logs for 30 days, ERROR logs for 90 days. Don\u0026rsquo;t retain indefinitely. The accidental logging of PII in a publicly accessible logging system has caused multiple high-profile security incidents. Make PII log hygiene a code review requirement.\n","date":"7 April 2026","externalUrl":null,"permalink":"/posts/system-design-basics/observability/","section":"Posts","summary":"Observability is the ability to understand what’s happening inside your system from the outside — from its outputs. The three pillars (logs, metrics, traces) are complementary tools, each answering different questions. Getting the combination right is what separates systems that you can reason about from systems that require tribal knowledge to debug.\n","title":"Observability: Logs, Metrics, Traces, and Alerting","type":"posts"},{"content":"Reliability isn\u0026rsquo;t about preventing failures — it\u0026rsquo;s about building systems that fail gracefully, recover quickly, and maintain user trust even when things go wrong. This post covers the patterns that keep systems running under degraded conditions.\nThe Resilience Toolkit # Timeout # Set a maximum time to wait for any external call. Without timeouts, a slow dependency causes your threads to pile up waiting, eventually exhausting your thread pool.\nConnection timeout: how long to wait to establish a connection Read timeout: how long to wait for data once connected Overall timeout: max end-to-end time (often the most important) Common mistake: Setting timeouts too conservatively (tight) causes spurious failures. Too loose defeats the purpose. Start with p99 latency of the dependency × 2, then tune based on observed behavior.\nRetry # Automatically retry failed requests. Handles transient failures (network glitch, brief overload) without user visibility.\nRetry only:\nIdempotent operations — retrying GET /users/123 is safe; retrying POST /payments is not (unless you have idempotency keys) Transient failures (500, 503, timeouts) — not client errors (400, 401, 404) Retry with exponential backoff + jitter:\nattempt 1: fail → wait 100ms attempt 2: fail → wait 200ms + random(0-50ms) attempt 3: fail → wait 400ms + random(0-100ms) attempt 4: give up Jitter prevents the \u0026ldquo;thundering herd\u0026rdquo; — all failed requests retrying simultaneously and hammering the recovering service.\nCircuit Breaker # Tracks the failure rate of calls to a dependency. When failures exceed a threshold, \u0026ldquo;opens\u0026rdquo; the circuit — subsequent calls fail fast without hitting the dependency. After a cooldown period, allows a probe request through. If it succeeds, the circuit \u0026ldquo;closes.\u0026rdquo;\nCLOSED (normal): calls pass through, failure rate tracked ↓ failure rate \u0026gt; threshold (e.g., 50% in 10s window) OPEN (degraded): calls fail immediately, no network I/O ↓ after cooldown (e.g., 30s) HALF-OPEN: one probe request allowed through ↓ probe succeeds → CLOSED ↓ probe fails → OPEN (reset cooldown) Why it matters: Without a circuit breaker, calls to a failed dependency keep trying, consuming threads and resources. The circuit breaker provides fast failure, which allows the calling service to handle the failure gracefully (fallback, error to user) rather than hanging.\nResilience4j is the standard Java implementation. Configurable via Spring Boot starters.\nBulkhead # Isolates failures to a limited scope. Named after ship compartments that contain flooding.\nThread pool bulkhead: Each external dependency gets its own thread pool. If calls to the inventory service hang and fill its thread pool, calls to the user service still have their threads available.\nSemaphore bulkhead: Limits concurrent calls to a dependency. Simpler than thread pools; less isolation but lower overhead.\nKubernetes resource limits: At the infrastructure level, setting resource requests/limits per service ensures one service\u0026rsquo;s memory leak doesn\u0026rsquo;t starve others.\nRate Limiting # Limit how many requests a caller can make within a time window. Protects services from being overwhelmed.\nApply at:\nAPI gateway: Rate limit per API key, per IP, per user Service level: Rate limit incoming requests before processing Client level: The calling service respects rate limits from dependencies How Retries Make Outages Worse # This is the most important resilience failure mode to understand.\nScenario: Service B is slow (taking 5s per request instead of 100ms). Service A calls B with a 1s timeout and 3 retries.\nA\u0026rsquo;s request takes 1s → timeout → retry → 1s → timeout → retry → 1s → final timeout Each of A\u0026rsquo;s requests consumes 3 seconds of B\u0026rsquo;s capacity instead of 1 B is now receiving 3x its normal request volume B gets slower (overloaded), A retries more, B gets slower\u0026hellip; This is a retry storm (or retry amplification). The retry behavior of clients under load amplifies the overload rather than relieving it.\nPrevention:\nExponential backoff + jitter — spread retry timing, reduce simultaneous retry bursts Circuit breaker — once failure rate is high, stop retrying and fail fast Max concurrency limits — Bulkhead prevents retry storms from consuming all available threads Retry budgets — at the system level, bound total retry volume (10% of calls may be retries; beyond that, fail) Idempotency + deduplication at the server — retries are safe because the server handles duplicates SLOs and Error Budgets # SLO (Service Level Objective): A target reliability level for your service. \u0026ldquo;99.9% of requests complete in \u0026lt; 200ms\u0026rdquo; or \u0026ldquo;99.5% availability per month.\u0026rdquo;\nSLI (Service Level Indicator): The measurement that tracks whether you\u0026rsquo;re meeting the SLO. The actual latency or error rate.\nSLA (Service Level Agreement): A contractual commitment, usually with financial consequences. SLOs are internal targets; SLAs are external commitments.\nError budget: The inverse of the SLO. If SLO is 99.9%, the error budget is 0.1% — the amount of \u0026ldquo;bad\u0026rdquo; time or requests you\u0026rsquo;re allowed per period.\nWhy error budgets change behavior:\nWhen the error budget is healthy → teams can ship faster (spending budget on experiments) When the error budget is depleted → reliability work takes priority over features This creates an automatic, objective-driven conversation between product and engineering. The SLO is the shared goal; the error budget is the operational dashboard. Setting SLOs: Start with user-observable outcomes. \u0026ldquo;Can the user complete checkout?\u0026rdquo; is a meaningful SLO. \u0026ldquo;Is the recommendation service responding?\u0026rdquo; is a component metric, not a user-facing SLO. Aggregate from user journeys down to components.\nGraceful Degradation # When a dependency fails, the system should degrade gracefully rather than fail completely.\nPattern: For each dependency, define what \u0026ldquo;no dependency\u0026rdquo; behavior looks like:\nRecommendations service is down → show popular items (static fallback) Personalization service is down → show generic content Inventory service is slow → proceed with order, validate inventory async (accept the risk) Auth cache is unavailable → route to auth service directly (slower, not broken) Feature flags for dependencies: If a dependency is unreliable, wrap its calls in a feature flag. When it degrades, disable the flag — users don\u0026rsquo;t see the feature, but the core system stays up.\nPoison Message Handling # A \u0026ldquo;poison message\u0026rdquo; is a message in a queue that causes the consumer to fail every time it processes it. Without handling, the consumer retries indefinitely, blocking all subsequent messages.\nSolution: Dead Letter Queue (DLQ)\nConfigure a maximum number of delivery attempts (e.g., 5). After 5 failures, move the message to a DLQ. The main consumer processes normally; the DLQ holds messages for investigation.\nRequired practices:\nAlert on DLQ depth — a non-empty DLQ is always worth investigating Inspect and replay or discard from DLQ deliberately Include correlation IDs and error context in the DLQ message Audit — \u0026ldquo;what messages have we failed to process?\u0026rdquo; has compliance implications What to investigate when a message is in the DLQ:\nBug in the consumer code (most common — schema change broke deserialization) Invalid data in the message (upstream published a malformed event) Transient dependency failure that became permanent (the DB it needed is gone) Active-Active vs Active-Passive Multi-Region # Active-Passive:\nOne region handles all traffic (active) Second region is on standby, ready to take over Failover requires: detecting failure, promoting the passive region (DNS change, routing update), warming up caches Simpler to operate, but failover takes minutes Stale data in passive region if replication lag exists Active-Active:\nBoth regions handle traffic simultaneously Users routed to their nearest region Writes must replicate between regions — consistency challenge Any write in region A must be visible to region B readers within an acceptable window Conflict resolution needed if both regions write the same record simultaneously When active-active is worth it:\nGlobal user base where cross-region latency hurts (US + EU + APAC) Zero-downtime requirement — any single region failure is instantly absorbed by others Compliance: data residency requirements may require certain users\u0026rsquo; data to stay in a region When it\u0026rsquo;s not:\nYou have predominantly single-region users The consistency complexity (conflict resolution, replication lag handling) outweighs the availability benefit Most teams overestimate the availability gap between a well-run active-passive and active-active Middle ground: Active-passive with pre-warmed standby (cache primed, DB replica ready, smoke tests running) and automated failover \u0026lt; 2 minutes. This handles 95% of DR requirements without the consistency complexity of active-active.\n","date":"7 April 2026","externalUrl":null,"permalink":"/posts/system-design-basics/reliability-resilience/","section":"Posts","summary":"Reliability isn’t about preventing failures — it’s about building systems that fail gracefully, recover quickly, and maintain user trust even when things go wrong. This post covers the patterns that keep systems running under degraded conditions.\n","title":"Reliability and Resilience: Circuit Breakers, Retries, SLOs, and Failure Modes","type":"posts"},{"content":"Scaling is not a synonym for \u0026ldquo;add more servers.\u0026rdquo; Each scaling lever has different costs, trade-offs, and appropriate circumstances. Reaching for the wrong one wastes money, adds complexity, or misses the actual bottleneck.\nVertical vs Horizontal: When Each Makes Sense # Vertical Scaling (Scale Up) # Add more CPU, RAM, or faster storage to the existing instance.\nVertical wins when:\nYou\u0026rsquo;re early stage and operational simplicity matters — one big instance is dramatically easier to operate than a distributed cluster The workload is hard to parallelize (stateful, requires shared memory, complex coordination) You have a single-node database that can\u0026rsquo;t shard easily — scaling vertical is often faster and safer than sharding The cost per unit of performance is better vertical than horizontal at your current scale You have a resource bottleneck (CPU-bound → more cores; memory-bound → more RAM) that\u0026rsquo;s clearly addressable vertically Modern cloud instances are powerful. An r7g.16xlarge on AWS has 64 vCPUs and 512GB RAM. Many \u0026ldquo;distributed systems\u0026rdquo; problems are actually premature — a single well-specced Postgres instance handles more than teams think.\nVertical ceiling: Every instance has a maximum size. When you hit it, horizontal is the only option. Also, vertical scaling usually requires downtime (resize the instance).\nHorizontal Scaling (Scale Out) # Add more instances behind a load balancer. The application must be stateless (or state must be externalized — Redis for sessions, S3 for uploads, DB for everything else).\nHorizontal wins when:\nThe workload is parallelizable and stateless You need high availability (if one instance dies, others serve traffic) You\u0026rsquo;ve exhausted or are close to the vertical ceiling You have autoscaling requirements (scale in/out dynamically with traffic) Different components need to scale independently (API tier vs worker tier) The Scaling Order: What to Reach for First # Given a scaling bottleneck, apply in this order. Each step costs less in complexity than the next.\n1. Optimize first — profiling often reveals the real bottleneck. Missing index? N+1 query? Over-fetching? Fix it. 2. Vertical scaling — upgrade the instance. No code changes. 3. Caching — eliminate the bottleneck entirely for reads. A cache hit costs ~1ms vs 50ms DB query. 4. Read replicas — distribute read traffic. Works for read-heavy workloads (most are). 5. Connection pooling — PgBouncer/Hikari tuning. Often the bottleneck before the DB itself. 6. Asynchronous processing — offload work. Non-critical writes → queue → worker → DB. 7. Horizontal scaling of the application tier. Stateless services scale easily. Add pods. 8. Database sharding or distributed DB. Last resort. High complexity, high operational cost. Don\u0026rsquo;t skip to step 8 because you\u0026rsquo;ve heard \u0026ldquo;at scale we\u0026rsquo;ll need sharding.\u0026rdquo; Most systems never reach that scale. Over-engineering for 10x-100x future load is the most common scaling mistake.\nRead Replicas vs Caching vs Sharding # Read Replicas:\nCopies of the database that serve reads. Primary handles writes. Eventually consistent — replicas lag behind the primary (usually milliseconds, can be more under heavy load) Works well when: most queries are reads, you don\u0026rsquo;t need read-after-write consistency on all reads Cost: you pay for the replica instance. With Aurora you pay per read replica. Limitation: writes still bottleneck at the primary Caching:\nEliminates DB reads entirely for frequently accessed, cacheable data Hit rate is the key metric — aim for \u0026gt; 90% for it to be worth it Works well for: lookup data, computed results, session data, anything where the same query is repeated Cost: Redis instance + cache invalidation complexity Caching before read replicas often makes more sense — a cache hit is faster than a replica query, and the operational complexity is similar Sharding:\nHorizontal partitioning of the database. Data for user IDs 0-999999 goes to shard 1, 1000000-1999999 to shard 2. Enables write scale-out — each shard handles a fraction of the write load Massive operational complexity: cross-shard queries don\u0026rsquo;t exist (or require scatter-gather), resharding is painful, hot shards require rebalancing Alternatives to hand-rolled sharding: Citus (Postgres extension), CockroachDB, PlanetScale (MySQL), Vitess You probably don\u0026rsquo;t need this unless you have hundreds of thousands of writes per second The Hot Partition Problem # In Kafka, DynamoDB, Cassandra, or any partitioned system: a \u0026ldquo;hot\u0026rdquo; partition receives disproportionate traffic while others are idle. This creates a bottleneck on a single node regardless of how many nodes you have.\nCauses:\nPartitioning by a low-cardinality key (partitioning an events table by event_type when 95% of events are PAGE_VIEW) Celebrity / power user effect: one user\u0026rsquo;s data getting 1000x more traffic than average Temporal patterns: partitioning by date and every write goes to today\u0026rsquo;s partition Solutions:\nSalting: Add a random suffix to the partition key (user_id_0, user_id_1, \u0026hellip;, user_id_N). Distributes writes across N partitions. Reads must query all N and merge. Write sharding with read-time aggregation: Write counters to multiple shards, sum at read time. Application-level rate limiting: Limit writes to a hot user/entity at the application layer before they hit the data store. Adaptive partitioning: Some systems (DynamoDB, Cosmos DB) auto-split hot partitions. Know whether your system supports this. CPU-Bound vs I/O-Bound Scaling # I/O-bound services (waiting for DB, HTTP calls, disk):\nThreads spend most time waiting, not executing Horizontal scaling (more instances) helps — each instance handles more requests Virtual threads (Java 21) or async I/O reduces the thread count needed Read replicas and caching reduce the wait time per request CPU-bound services (image processing, ML inference, cryptography, complex computation):\nThreads are executing, not waiting More cores = more throughput (vertical scale or more instances) Virtual threads don\u0026rsquo;t help — CPU is the constraint, not thread scheduling Consider: offload CPU-intensive work to dedicated workers, GPU instances for ML workloads, precomputation and caching of results Autoscaling: What Metric to Scale On # The choice of scaling metric determines how well autoscaling responds to load.\nCPU utilization (most common):\nWorks for CPU-bound services Lags for I/O-bound services — threads are waiting, CPU is low, but latency is high Scale trigger: CPU \u0026gt; 70% → add instances Request queue depth / pending messages:\nBetter for queue consumer workers \u0026ldquo;When the queue has \u0026gt; 1000 messages, add consumers\u0026rdquo; Direct signal that work is backing up Custom business metrics:\nScale on \u0026ldquo;requests in flight\u0026rdquo; or \u0026ldquo;P95 latency \u0026gt; 200ms\u0026rdquo; Requires custom metrics export (Prometheus → KEDA, CloudWatch → ASG) Most accurate but requires instrumentation Memory utilization:\nRarely the right primary scaling metric (memory doesn\u0026rsquo;t correlate with load the same way) Useful as a ceiling alarm (OOM prevention), not a scale trigger Best practice: For API services, scale on CPU + request rate. For async workers, scale on queue depth. Set minimum instances high enough to handle baseline load without cold-start latency on scale-out. Test autoscaling behavior with load tests — not just at steady state but at scale-up and scale-down transitions.\n","date":"7 April 2026","externalUrl":null,"permalink":"/posts/system-design-basics/scaling-strategies/","section":"Posts","summary":"Scaling is not a synonym for “add more servers.” Each scaling lever has different costs, trade-offs, and appropriate circumstances. Reaching for the wrong one wastes money, adds complexity, or misses the actual bottleneck.\n","title":"Scaling Strategies: A Decision Framework","type":"posts"},{"content":"Consistency and availability trade-offs show up in nearly every system design discussion. The theory (CAP, PACELC) is well-known; the practical application — knowing which choice to make for a specific use case — is what separates a design-literate engineer from one who just quotes theorems.\nCAP Theorem: The Actual Claim # CAP states that in the presence of a network partition, a distributed system must choose between Consistency (all nodes see the same data at the same time) and Availability (every request receives a response, though it may be stale).\nWhat CAP doesn\u0026rsquo;t mean:\nIt\u0026rsquo;s not a binary permanent choice — modern systems tune per-operation consistency \u0026ldquo;Consistent\u0026rdquo; in CAP means linearizable consistency (strongest form) — not just \u0026ldquo;data is sometimes accurate\u0026rdquo; Network partitions are rare but inevitable. The real question is \u0026ldquo;what do you do when they happen?\u0026rdquo; The practical framing: Most distributed systems are not in a constant state of partition. The everyday trade-off isn\u0026rsquo;t about partitions — it\u0026rsquo;s about consistency vs latency, which is what PACELC addresses.\nPACELC: The More Useful Model # PACELC: During a Partition, choose between Availability and Consistency. Else (normal operation), choose between Latency and Consistency.\nThe \u0026ldquo;Else\u0026rdquo; clause is what matters day-to-day. In normal operation:\nConsistent reads require coordinating with enough replicas to guarantee the latest write is seen. This takes time. Low-latency reads can return from the nearest replica, which may be slightly behind. This is the everyday trade-off: do you pay latency for consistency, or accept some staleness for speed?\nDatabase examples:\nPostgres (single node): PC/EC — consistent, not distributed Cassandra: PA/EL — prefers availability during partition, low latency over consistency in normal operation. Tunable. DynamoDB: PA/EL by default, PA/EC with strong consistent reads option Spanner/CockroachDB: PC/EC — global strong consistency via TrueTime / HLC. You pay the latency. ZooKeeper: PC/EC — consistency over availability Eventual Consistency: When It\u0026rsquo;s the Right Choice # Eventual consistency means: if no new updates are made, all replicas will eventually converge to the same value. There\u0026rsquo;s a window during which replicas may return different values.\nWhere eventual consistency is fine:\nSocial media feed (10ms of lag between user posts is imperceptible) Product catalog (price changes propagate within seconds — acceptable) User preferences / settings (slight delay in reflecting saved settings is fine) Shopping cart read (showing a slightly stale version on render is fine; write always goes to the authoritative store) View counts, like counts, recommendations Where eventual consistency is dangerous:\nBank balance (two concurrent reads could both show sufficient balance, leading to double-spend) Inventory reservation (two requests could both see 1 item available and both succeed) Authentication tokens (revoked token should not be usable after revocation) Order fulfillment (committing to fulfill an order requires accurate inventory state) The pattern: eventual consistency is fine for reads of data that isn\u0026rsquo;t used as a gate on a consequential write. As soon as the read determines whether to allow a write (inventory check → place order), you need a stronger guarantee.\nRead-After-Write Consistency # A specific consistency requirement that comes up constantly: after a user writes data, they should see their own write when they read.\nThe failure mode: User updates their profile picture. They refresh — and see the old picture. The read went to a replica that hasn\u0026rsquo;t caught up yet. User thinks the save failed; they click save again. Race conditions ensue.\nHow to achieve it:\nRoute reads after write to the primary. Simple. Adds latency (primary may be farther away). Track the write\u0026rsquo;s replication token and only serve the read from a replica that has caught up to that token. DynamoDB and some Postgres drivers support this. Read your own writes via the cache. After writing, update the cache. Reads go to cache first. TTL ensures eventual fallback to replica. Client-side state. Don\u0026rsquo;t re-fetch after write — update the local state optimistically. User sees their write immediately because the client renders it; the replica discrepancy is irrelevant. Strong Consistency: When to Pay for It # Strong (linearizable) consistency means a read always returns the most recent committed write. Every reader sees a consistent, global ordering of operations.\nWhen it\u0026rsquo;s worth the latency and complexity:\nFinancial transactions — account balance, ledger entries Inventory management — decrement stock only if available Distributed locking — only one holder at a time Seat reservations, ticket booking — no double-booking Authentication / authorization state — revoked tokens must not grant access The implementation question: How do you achieve it? Options:\nRoute to primary — simplest, the primary is authoritative Quorum reads — read from majority of replicas (Cassandra QUORUM, DynamoDB strong reads) Serializable isolation — full serializable transaction isolation in Postgres Optimistic locking — read a version number, write only if version matches, retry on conflict Bank Balance vs Social Feed: A Contrast # Bank Balance:\nReads must be strongly consistent — you\u0026rsquo;re making a decision (can I withdraw?) based on this read Writes must be atomic and durable Consistency model: serializable transactions on the ledger Availability trade-off: it\u0026rsquo;s acceptable to return an error rather than a stale balance Implementation: transactions against a single authoritative database; replicas for reporting only Social Feed:\nReads can be eventually consistent — 50ms of lag in feed updates is imperceptible High write throughput (millions of posts/second globally) Consistency model: eventual, with monotonic reads (you don\u0026rsquo;t see posts disappear after you\u0026rsquo;ve seen them) Availability trade-off: it\u0026rsquo;s better to show a slightly stale feed than to return an error Implementation: fan-out on write (push to follower timelines) or fan-out on read (pull and merge), Cassandra or Redis for timeline storage, CDN caching for popular feeds Explaining CAP to a Product Manager # The honest, non-technical explanation:\n\u0026ldquo;When our database servers can\u0026rsquo;t talk to each other (a network split), we have a choice: do we keep accepting writes and reads (availability), or do we refuse operations until we know all servers agree on the current data (consistency)?\nFor most of our features — feed, search, recommendations — it\u0026rsquo;s fine if different users see slightly different results for a few seconds. We prioritize availability.\nFor payments and inventory, we cannot show you a balance that\u0026rsquo;s even 1 cent wrong. We prioritize consistency, and we\u0026rsquo;ll return an error rather than give you incorrect data.\u0026rdquo;\nThen anchor it to the product: \u0026ldquo;This is why the checkout flow sometimes shows an \u0026lsquo;out of stock\u0026rsquo; error even after you saw 1 item available — the inventory check happened at a different moment, and we\u0026rsquo;d rather give you a correct error than charge you for something we can\u0026rsquo;t fulfill.\u0026rdquo;\n","date":"7 April 2026","externalUrl":null,"permalink":"/posts/system-design-basics/consistency-availability-cap/","section":"Posts","summary":"Consistency and availability trade-offs show up in nearly every system design discussion. The theory (CAP, PACELC) is well-known; the practical application — knowing which choice to make for a specific use case — is what separates a design-literate engineer from one who just quotes theorems.\n","title":"Consistency, Availability, and the CAP/PACELC Trade-off","type":"posts"},{"content":"The microservices vs monolith debate is one of the most over-indexed topics in software architecture — teams decompose too early, pay operational costs they\u0026rsquo;re not ready for, and spend months untangling the mess. The decision framework is simpler than the discourse suggests.\nStart With the Questions, Not the Conclusion # When a team says \u0026ldquo;we want to break our monolith into microservices,\u0026rdquo; the right response isn\u0026rsquo;t to approve or reject — it\u0026rsquo;s to ask:\n1. What problem are you trying to solve?\nDeployment independence? (\u0026ldquo;The payments team is blocked waiting for the user team to release\u0026rdquo;) Scale independence? (\u0026ldquo;Search needs to scale to 100x but billing doesn\u0026rsquo;t\u0026rdquo;) Team autonomy? (\u0026ldquo;12 teams working in one codebase is causing constant conflicts\u0026rdquo;) Technology heterogeneity? (\u0026ldquo;We need to use Python for ML but Java for the API\u0026rdquo;) Reliability isolation? (\u0026ldquo;A bug in the recommendation engine shouldn\u0026rsquo;t take down checkout\u0026rdquo;) If you can\u0026rsquo;t answer this specifically, the motivation is likely \u0026ldquo;microservices are modern\u0026rdquo; — which is not a reason.\n2. What\u0026rsquo;s the team\u0026rsquo;s operational maturity? Microservices require: distributed tracing, per-service monitoring, independent CI/CD pipelines, service discovery, centralized logging, network policies, and on-call runbooks for N services instead of 1. Most teams underestimate this by 10x.\n3. What\u0026rsquo;s the team size? Conway\u0026rsquo;s Law is real: your system architecture mirrors your communication structure. The rough heuristic: one service per team (or per two-pizza team). If you have 5 engineers, you don\u0026rsquo;t need 15 services.\nThe Modular Monolith: The Middle Ground You\u0026rsquo;re Not Considering # Before jumping to microservices, ask: \u0026ldquo;Have we tried making our monolith modular first?\u0026rdquo;\nA modular monolith has:\nClear module boundaries enforced by the package structure or module system Well-defined interfaces between modules (no direct cross-module field access) Independent test suites per module The ability to extract a module into a service later if needed The modular monolith gives you most of the domain separation benefits without the operational overhead. It\u0026rsquo;s dramatically underrated.\nWhen does the modular monolith break down?\nDifferent scaling requirements that can\u0026rsquo;t be addressed by horizontal scaling the whole app True deployment independence is needed (different teams, different release cycles) Different reliability requirements (one component fails constantly, don\u0026rsquo;t want it taking down everything) Genuine technology heterogeneity needs The Right Size for a Microservice # \u0026ldquo;What\u0026rsquo;s the right size?\u0026rdquo; is the wrong framing. The right framing: what are the right boundaries?\nGood service boundaries:\nAlign with a bounded context (DDD) — the service owns a coherent domain concept and its data Own their data — no shared database between services Have minimal coordination requirements — calling another service for every operation signals a misaligned boundary Have independent deployability — can be deployed without coordinating with other services The seam question: \u0026ldquo;If I change this service, do I always have to change that other service at the same time?\u0026rdquo; If yes, they\u0026rsquo;re too coupled and should probably be one service.\nSigns your service is too small (nanoservices):\nEvery business operation requires calling 5+ services Most services are essentially pass-throughs with no logic Network hops dominate your latency A \u0026ldquo;simple\u0026rdquo; feature requires deploying 4 services Signs your service is too large:\nMultiple teams are working on the same service and blocking each other The service has clear internal sub-domains that have different scaling or reliability requirements Deployments take hours and are risky because the blast radius is huge Shared Code Across Services: The Coupling Trap # When multiple services share a library, that library becomes a coordination point. The failure mode:\ncommon-lib contains the User model, Order model, validation logic Service A updates common-lib to add a field to User Service B, C, D, E all must update their common-lib dependency or the build breaks You\u0026rsquo;ve recreated the monolith as a distributed dependency graph What to share vs what not to:\nShare: Logging libraries, telemetry instrumentation, security token parsing utilities, internal HTTP client wrappers. These are infrastructure concerns, not domain concerns. Don\u0026rsquo;t share: Domain models, business validation logic, data transfer objects that represent domain concepts. Each service should own its domain types. Prefer duplication over wrong abstraction. Two services having their own User class with slightly different fields is usually better than a shared class that satisfies neither cleanly. The Operational Costs People Underestimate # Microservices don\u0026rsquo;t reduce complexity — they trade one kind of complexity for another.\nWhat you gain: Deployment independence, scale independence, team autonomy, technology heterogeneity, fault isolation.\nWhat you pay:\nDistributed system problems. Every service call can fail, timeout, return stale data, or experience network partition. You need timeouts, circuit breakers, retries, and idempotency everywhere. Observability complexity. A single request now touches 5 services. Without distributed tracing (Jaeger, Zipkin, Tempo), debugging is nearly impossible. Testing complexity. Integration testing a distributed system requires either mocks (fragile) or a real environment (expensive). Contract testing helps but adds process overhead. Data consistency. No cross-service transactions. Saga patterns, eventual consistency, and compensation logic must be designed and tested. Operational overhead. N services means N deployment pipelines, N monitoring dashboards, N on-call runbooks, N certificate renewals, N infrastructure configs. The rule of thumb: Each new service needs someone to own it. If no one has the bandwidth to own it — to monitor it, to be on-call for it, to maintain its runbook — don\u0026rsquo;t extract it yet.\nDistributed Transactions: Saga, Outbox, and 2PC # When an operation spans multiple services, you can\u0026rsquo;t use a database transaction. The patterns:\nSaga Pattern # Break the distributed operation into a sequence of local transactions. If a step fails, execute compensating transactions to undo previous steps.\nChoreography-based Saga: Each service publishes events and listens for events from other services. Loosely coupled, but the overall business flow is implicit — hard to see, hard to debug.\nOrderService: ORDER_CREATED event → InventoryService: INVENTORY_RESERVED event → PaymentService: PAYMENT_CHARGED event → OrderService: ORDER_CONFIRMED Failure: PaymentService publishes PAYMENT_FAILED → InventoryService: INVENTORY_RELEASED Orchestration-based Saga: A central orchestrator (or saga coordinator) explicitly tells each service what to do and handles failures.\nSagaOrchestrator: 1. Call InventoryService.reserve() → success 2. Call PaymentService.charge() → fails 3. Call InventoryService.release() (compensate) 4. Return failure to caller Orchestration is more visible and debuggable; choreography is more decoupled. For complex multi-step sagas, orchestration is often more maintainable.\nOutbox Pattern # Guarantees that a database write and a message publication are atomic, without two-phase commit.\nBEGIN TRANSACTION INSERT INTO orders(id, ...) VALUES (...) INSERT INTO outbox(event_type, payload) VALUES (\u0026#39;ORDER_CREATED\u0026#39;, {...}) COMMIT -- Separate process: SELECT * FROM outbox WHERE published = false Publish to Kafka UPDATE outbox SET published = true The outbox and the business data are in the same database, so they\u0026rsquo;re committed atomically. The publisher reads from the outbox and delivers to the message broker. At-least-once delivery — consumers must be idempotent.\n2PC (Two-Phase Commit) # Theoretically guarantees atomic commit across multiple systems. In practice: the coordinator becomes a single point of failure, blocking locks are held during the prepare phase, and failure scenarios are complex and hard to test. Almost never the right answer in microservices.\nThe EM stance: Design service boundaries to minimize distributed transactions. If you\u0026rsquo;re writing a saga for every operation, your service boundaries are wrong.\n","date":"7 April 2026","externalUrl":null,"permalink":"/posts/system-design-basics/microservices-vs-monolith/","section":"Posts","summary":"The microservices vs monolith debate is one of the most over-indexed topics in software architecture — teams decompose too early, pay operational costs they’re not ready for, and spend months untangling the mess. The decision framework is simpler than the discourse suggests.\n","title":"Microservices vs Monolith: Making the Right Architecture Call","type":"posts"},{"content":"API design decisions have long tails — once you publish an API and clients integrate with it, changing it is expensive. The choice of protocol, versioning strategy, and backwards compatibility approach should be deliberate, not defaults.\nREST: The Default Choice and Why It\u0026rsquo;s Usually Right # REST is HTTP-native — it uses standard verbs (GET, POST, PUT, PATCH, DELETE), status codes, headers, and content negotiation. It\u0026rsquo;s stateless, cacheable, and every HTTP client in existence can call it.\nREST wins when:\nYour consumers are diverse (mobile apps, third-party developers, browsers, other services) You need HTTP caching (GET responses with Cache-Control) The access patterns map naturally to resources and CRUD Team familiarity matters — REST is the most widely understood API style You need public or partner APIs where simplicity and documentation matter REST\u0026rsquo;s weaknesses:\nOver-fetching: API returns a User object with 30 fields; client needed 3. Wastes bandwidth and parsing time, especially on mobile. Under-fetching: Client needs user + orders + profile. Three round trips unless you build a custom endpoint. Versioning drift: Over time, APIs accumulate versions and deprecated fields, and the surface area becomes unwieldy. For most internal and external APIs, these weaknesses are manageable with thoughtful design (field selection, composite endpoints for common patterns) and don\u0026rsquo;t justify the complexity of an alternative.\nGraphQL: When It\u0026rsquo;s Worth the Complexity # GraphQL is a query language — clients specify exactly what data they need in the shape they need it.\nquery { user(id: \u0026#34;123\u0026#34;) { name email orders(last: 5) { id status total } } } GraphQL wins when:\nMultiple clients with different data needs. Mobile app needs fewer fields; web app needs more. With REST, you build multiple endpoints or bloat the response. With GraphQL, each client requests exactly what it needs. BFF (Backend for Frontend) aggregation. A single GraphQL layer aggregates data from multiple backend services. The client doesn\u0026rsquo;t need to know about backend service topology. Rapidly evolving data model. Adding new fields doesn\u0026rsquo;t break existing queries. Deprecating fields is visible in the schema. Complex, nested data relationships. GraphQL resolvers compose naturally for graph-shaped data. GraphQL\u0026rsquo;s real costs:\nCaching is harder. REST GET requests are trivially cacheable by URL. GraphQL queries are POST requests with a body — HTTP caching doesn\u0026rsquo;t apply by default. You need application-level caching (persisted queries, DataLoader for N+1 batching). N+1 queries are easy to introduce. A naive GraphQL resolver fetches each item\u0026rsquo;s related data in a loop. DataLoader batches these, but it must be implemented correctly. Error handling is non-standard. GraphQL returns HTTP 200 even when the query partially fails (errors in the errors array). This breaks conventional monitoring that keys on HTTP status codes. Security surface: Clients can write arbitrarily complex queries. Depth limiting, query complexity budgets, and persisted queries are necessary to prevent abuse. Tooling and expertise: The ecosystem is good but smaller than REST. Debugging, federation (Apollo Federation), schema stitching — all add complexity. The honest EM take: GraphQL is genuinely valuable for consumer-facing APIs where multiple clients (iOS, Android, web) have divergent data needs, or for a BFF aggregation layer. For internal service-to-service communication, it\u0026rsquo;s rarely the right choice — gRPC or REST is simpler.\ngRPC: Internal Service-to-Service Communication # gRPC uses Protocol Buffers (binary serialization) over HTTP/2. It\u0026rsquo;s contract-first — the .proto file defines the API, and code is generated for both client and server.\nservice UserService { rpc GetUser (UserRequest) returns (UserResponse); rpc StreamUserEvents (UserRequest) returns (stream UserEvent); } gRPC wins when:\nInternal service-to-service communication where performance matters Strongly typed contracts between services reduce integration bugs You want auto-generated client libraries in multiple languages You need streaming (server streaming, client streaming, bidirectional streaming) Polyglot microservices — generated clients work in Go, Java, Python, etc. gRPC\u0026rsquo;s costs:\nNot browser-native — gRPC-Web proxy needed for browser clients (adds complexity) Binary protocol means you can\u0026rsquo;t curl it without tooling (grpcurl, Postman with gRPC support) HTTP/2 can be problematic through certain proxies, load balancers, and firewalls Protobuf schema evolution requires discipline (don\u0026rsquo;t reuse field numbers) Steeper learning curve than REST for teams new to it REST vs gRPC for internal services:\nSmall team, REST expertise, simple request/response: REST is fine Performance-critical inter-service calls, polyglot environment, strict typing: gRPC The performance difference (binary vs JSON, HTTP/2 multiplexing) is real but usually not the bottleneck — don\u0026rsquo;t over-optimize API Versioning # Versioning is a commitment to support multiple API behaviors simultaneously. Choose your strategy upfront because changing it later is painful.\nURL Versioning (/v1/users, /v2/users) # Explicit, discoverable Easy to route at API gateway Clients know exactly what version they\u0026rsquo;re using Version proliferation: /v1, /v2, /v3 requires parallel maintenance Header Versioning (Accept: application/vnd.api+json;version=2) # Clean URLs Harder to test (can\u0026rsquo;t just change the URL) Less discoverable Often used for content negotiation-style versioning No Versioning (Evolution instead) # Only add fields, never remove them Use @deprecated annotation in schemas and documentation Set a sunset date and enforce client migration Requires disciplined schema evolution (additive-only changes) Works well for mature APIs with trusted consumers Recommendation: URL versioning for public APIs (clarity over elegance). No versioning with additive-only evolution for internal APIs with internal consumers where you can coordinate migrations.\nBackwards Compatibility # When changing an API used by many clients, the risks are:\nRemoving a field a client depends on Changing a field\u0026rsquo;s type Changing behavior of an existing operation Safe changes (backwards compatible):\nAdding optional fields to requests Adding fields to responses (clients must ignore unknown fields — enforce this) Adding new endpoints Adding new enum values (with care — some clients break on unknown enums) Breaking changes:\nRemoving or renaming fields Changing field types Changing error codes or response structure Changing required/optional semantics Consumer-driven contract testing (Pact): Publish a contract describing what each consumer uses. CI checks that new API versions don\u0026rsquo;t violate any published contracts. This is the most rigorous approach for a large consumer base.\nSunset headers: Deprecation: true, Sunset: Sat, 01 Jan 2027 00:00:00 GMT. Programmatic signal to clients to migrate. Monitor usage of deprecated endpoints before removal.\nWebSockets and Server-Sent Events vs Polling # Polling: Client calls /status?id=123 every N seconds. Simple, stateless, easy to scale. Every client is bombarded with unnecessary requests. Acceptable for low-frequency status checks (job status, slow-changing data).\nLong Polling: Client makes a request; server holds it open until there\u0026rsquo;s data to send (or timeout). Reduces unnecessary requests but complicates server-side connection management. Largely superseded by SSE and WebSockets.\nServer-Sent Events (SSE): HTTP-based unidirectional push from server to client. Standard EventSource API in browsers. Automatic reconnection. Works through most proxies. Good for: live dashboards, news feeds, notification pushes, progress updates.\nWebSockets: Full-duplex, bidirectional. Client and server both push and receive. More complex to scale (stateful connections, sticky sessions or pub/sub fan-out layer). Good for: chat applications, real-time collaborative editing, live gaming, trading platforms.\nThe decision:\nOne-way server-to-client push, browser client: SSE Bidirectional real-time communication: WebSocket Infrequent updates, simple implementation: polling Never use WebSockets just because \u0026ldquo;it\u0026rsquo;s faster\u0026rdquo; for standard request/response — the overhead of connection management outweighs the benefit ","date":"7 April 2026","externalUrl":null,"permalink":"/posts/system-design-basics/api-design/","section":"Posts","summary":"API design decisions have long tails — once you publish an API and clients integrate with it, changing it is expensive. The choice of protocol, versioning strategy, and backwards compatibility approach should be deliberate, not defaults.\n","title":"API Design: REST vs GraphQL vs gRPC","type":"posts"},{"content":"The choice between a message queue and an event streaming platform shapes your architecture more than almost any other infrastructure decision. Getting it wrong means rebuilding — not reconfiguring. Here\u0026rsquo;s how to think through it.\nMessage Queue vs Event Streaming: The Fundamental Distinction # This distinction matters before you pick a product.\nMessage queue (RabbitMQ, SQS, ActiveMQ):\nA message is a task or command for a consumer Typically consumed once — it\u0026rsquo;s deleted after successful processing Consumer drives the pace — pull or push, but once processed, it\u0026rsquo;s gone Good for: work distribution, background job processing, decoupled command execution Event streaming (Kafka, Kinesis, Google Pub/Sub):\nAn event is a fact — something that happened. It\u0026rsquo;s retained on the log. Multiple independent consumers can read the same events at their own pace The log is append-only and retained (configurable, but can be days/weeks/forever) Good for: audit trail, replayability, multiple consumers with different read positions, event sourcing, CDC The test question: \u0026ldquo;Do you need to replay events? Do multiple independent consumers need to process the same event for different purposes?\u0026rdquo; If yes, you need event streaming. If it\u0026rsquo;s just task distribution, a queue is simpler and sufficient.\nKafka: When to Use It # Kafka is the dominant event streaming platform. It\u0026rsquo;s designed for high-throughput, ordered, durable, replayable event logs.\nKafka wins when:\nYou have high write volume (millions of events/second) Multiple consumers need to process the same events independently (analytics + order processing + fraud scoring all from the same order event) You need replay — re-process historical events for a new consumer, replay after bug fix, backfill a new data store You need exactly-ordered processing within a partition Event sourcing — your system\u0026rsquo;s state is derived from the event log CDC pipeline — database changes published as events Kafka\u0026rsquo;s costs:\nOperational complexity — Zookeeper (pre-3.3) or KRaft, broker sizing, partition count decisions, consumer group management, rebalancing, lag monitoring Not a queue — consumer state (offset) is managed by the consumer. At-least-once delivery is the norm. Exactly-once is possible but requires transactional producers and idempotent consumers. Partition count is set at topic creation — scaling partitions later requires rebalancing Latency floor is ~5ms; not designed for ultra-low-latency use cases \u0026ldquo;Your team wants to introduce Kafka — what questions do you ask?\u0026rdquo;\nWhat problem is Kafka solving that a simple queue or synchronous call doesn\u0026rsquo;t solve? Who will operate it? Do we have Kafka expertise or budget for managed Kafka (Confluent Cloud, MSK)? Do we need replayability / multiple consumers / high throughput, or just decoupling? What\u0026rsquo;s the schema evolution strategy for event payloads? (Avro + Schema Registry, Protobuf, JSON with versioning?) How will we monitor consumer lag and set alerts? What\u0026rsquo;s the data retention requirement? RabbitMQ: When It\u0026rsquo;s the Right Tool # RabbitMQ is a traditional message broker: AMQP protocol, exchanges, queues, routing. Simpler to operate than Kafka, well-suited for work distribution.\nRabbitMQ wins when:\nYou need sophisticated message routing (topic exchanges, header-based routing, dead letter queues) You need per-message TTL and priority queues Consumer-driven acknowledgement model is important (consume → process → ack/nack) Lower throughput requirements (thousands/second, not millions) You need complex queuing topologies Work distribution where each message goes to exactly one consumer (competing consumers pattern) RabbitMQ vs Kafka:\nRabbitMQ Kafka Model Message queue Event log Consumers One consumer per message Multiple independent consumers Replay No Yes Throughput Thousands/sec Millions/sec Retention Until consumed Configurable (time or size) Routing Flexible (exchanges) Partition-based Ops complexity Lower Higher Best for Task distribution, work queues Event streaming, CDC, audit SQS and SNS: The AWS Default # If you\u0026rsquo;re on AWS and don\u0026rsquo;t have strong reasons for self-hosted Kafka or RabbitMQ, SQS + SNS is the path of least resistance.\nSQS Standard: At-least-once delivery, best-effort ordering. Simplest, highest throughput.\nSQS FIFO: Exactly-once processing, strict ordering (within a message group). Max 3,000 messages/second per queue (with batching). Use when order matters (financial transactions, user command sequences).\nSNS + SQS fan-out: SNS topic → multiple SQS queues. One event, multiple independent consumers. Approximates Kafka\u0026rsquo;s multi-consumer model for lower throughput cases.\nLimitations vs Kafka:\nNo replay — messages are deleted after consumption (even in FIFO) Max retention 14 days No consumer offset management Fan-out requires SNS topic + queue per consumer (more infrastructure) When SQS is enough: Your use case is background jobs, async processing, simple work distribution, and you don\u0026rsquo;t need replay or multiple consumers reading the same event history.\nExactly-Once Semantics: Do You Actually Need It? # \u0026ldquo;Exactly-once\u0026rdquo; is often misunderstood. There are two levels:\nExactly-once delivery: The message is delivered exactly once to the broker consumer. Kafka supports this with enable.idempotence=true + transactional.id.\nExactly-once processing (end-to-end): The downstream effect of the message happens exactly once. This requires idempotent consumers — the same message processed twice produces the same result.\nThe honest answer: Exactly-once delivery is achievable. Exactly-once end-to-end semantics require idempotent consumers, which is a design requirement on your business logic. You cannot guarantee exactly-once without idempotent processing on the consumer side.\nPractical approach: Design consumers to be idempotent (deduplicate by event ID), accept at-least-once delivery, and handle duplicates gracefully. This is simpler and more reliable than relying on transactional exactly-once, which has significant throughput overhead and operational complexity.\nSynchronous REST vs Async Messaging: The Decision # This comes up for every service interaction. The framework:\nUse synchronous REST/gRPC when:\nThe caller needs an immediate response with the result The operation is quick (\u0026lt; a few hundred ms) Failure should be surfaced immediately to the caller The client needs to know if the operation succeeded before continuing Example: \u0026ldquo;Is this user authorized?\u0026rdquo; — you need the answer now Use async messaging when:\nThe operation is long-running or the caller doesn\u0026rsquo;t need immediate confirmation of completion You want to decouple services so a downstream slowdown doesn\u0026rsquo;t propagate upstream Multiple services need to react to the same event The operation can be retried without user-visible impact Example: \u0026ldquo;Order placed — trigger inventory reservation, email confirmation, fraud check\u0026rdquo; — all can happen async Hybrid pattern (command + event): Accept a request synchronously (validate and persist), return a correlation ID, and process asynchronously. Client polls or receives a callback/webhook. Used in payment processing, video encoding, document generation.\nSchema Evolution in Event Payloads # Events accumulate technical debt. A schema you can\u0026rsquo;t change without breaking consumers is a serious problem. Strategies:\n1. Avro + Schema Registry (Confluent/Apicurio): Binary serialization with a central schema registry. Producers/consumers validate compatibility before publishing. Schema evolution rules enforced at write time: backward compatible (add optional fields), forward compatible (remove optional fields), fully compatible.\n2. Protobuf: Binary, backward/forward compatible by design if you follow the rules (don\u0026rsquo;t reuse field numbers, mark removed fields reserved). Good if you already use gRPC.\n3. JSON with versioning: Include a version or schemaVersion field. Consumers check and handle accordingly. Flexible but requires discipline — no enforcement at publish time.\n4. Event versioning patterns:\nSame topic, versioned field: { \u0026quot;version\u0026quot;: 2, ... }. Simple but consumers must handle multiple versions. Separate topics per version: orders-v1, orders-v2. Clean isolation but proliferates topics. Upcasting: Consumer converts v1 events to v2 format at read time. Good for replay scenarios. EM stance: Enforce schema compatibility programmatically from day one. An ad-hoc JSON schema without enforcement will break consumers within 6 months of the first \u0026ldquo;quick change.\u0026rdquo;\n","date":"7 April 2026","externalUrl":null,"permalink":"/posts/system-design-basics/messaging-event-driven/","section":"Posts","summary":"The choice between a message queue and an event streaming platform shapes your architecture more than almost any other infrastructure decision. Getting it wrong means rebuilding — not reconfiguring. Here’s how to think through it.\n","title":"Messaging and Event-Driven Architecture: Kafka vs RabbitMQ vs SQS","type":"posts"},{"content":"Caching is the single highest-leverage performance tool available — and also one of the most common sources of production bugs. The decision isn\u0026rsquo;t just \u0026ldquo;should we cache?\u0026rdquo; — it\u0026rsquo;s where, how, and what the consistency implications are.\nCache Placement: Where Does the Cache Live? # Each layer has different latency, scope, and invalidation complexity.\nClient-Side Cache # Browser cache, mobile app cache. Controlled by HTTP Cache-Control headers. The cheapest possible cache — zero server load. Appropriate for truly static content (JS bundles, images, CSS). Not appropriate for user-specific or frequently changing data without careful ETag/Last-Modified handling.\nCDN Cache # Globally distributed edge nodes (Cloudflare, CloudFront, Fastly). Serves static assets and cacheable responses from a location close to the user. CDN caching can absorb enormous traffic spikes — a viral article getting 10M requests hits the CDN, not your origin.\nKey decision: What can you put on the CDN? Anything that\u0026rsquo;s the same for all users (or can be personalized at the edge via cookies/JWT) and doesn\u0026rsquo;t change too frequently. Product pages, landing pages, API responses with Cache-Control: public, max-age=300.\nAPI Gateway / Reverse Proxy Cache # NGINX or API Gateway caches responses. Useful when a large percentage of requests ask for the same thing (public API endpoints, rate-limited reads). Shared across all backend instances.\nApplication-Level Cache # Your service\u0026rsquo;s in-memory cache or a shared Redis instance. This is where most teams focus — it\u0026rsquo;s flexible and gives the most control.\nLocal (in-process) cache: Java ConcurrentHashMap, Caffeine, Guava Cache. Sub-microsecond reads, but not shared across service instances. If you have 10 pods, each has its own copy — inefficient for large datasets. Also invalidation is tricky — you need to handle cache coherence across instances.\nDistributed cache (Redis, Memcached): Shared across all service instances. A cache miss or invalidation from any instance affects all. Higher latency than local cache (~1ms vs nanoseconds) but consistent view across instances.\nMulti-level caching: Local L1 + Redis L2. Cache popular items in-process, fall back to Redis, fall back to DB. Complex to invalidate correctly — usually only worth it for extremely hot data.\nDatabase Query Cache # Postgres has no query result cache (it was removed — too many correctness problems). MySQL has a query cache too (also removed in 8.0). Most \u0026ldquo;DB caching\u0026rdquo; happens in the DB\u0026rsquo;s buffer pool — keep frequently accessed data in memory via proper sizing.\nRedis vs Memcached # This is mostly settled: use Redis unless you have a specific reason not to.\nMemcached is marginally faster at pure LRU string cache operations at extreme scale, and it\u0026rsquo;s truly multi-threaded (useful for multi-core cache machines). But:\nRedis supports strings, hashes, lists, sets, sorted sets, streams, HyperLogLog, geo-indexes Redis has persistence options (RDB + AOF) — cache survives restarts with warm data Redis Cluster for horizontal scaling Redis has Lua scripting for atomic multi-step operations Redis 6+ is multi-threaded for network I/O When Memcached still makes sense: You\u0026rsquo;re in a pure LRU string-cache scenario at extreme scale and have existing Memcached expertise and tooling. Almost no new systems should choose Memcached today.\nCache Patterns: Cache-Aside, Write-Through, Write-Behind # Cache-Aside (Lazy Loading) # The most common pattern. Application code manages the cache explicitly.\nREAD: 1. Check cache → hit? return. 2. Miss → query DB → store in cache → return. WRITE: 1. Write to DB. 2. Invalidate (or update) cache entry. Advantages: Only caches data that\u0026rsquo;s actually read. Resilient to cache failures (fall through to DB). Easy to implement.\nDisadvantages: Cache miss causes noticeable latency (cache fill under load). Initial cold start hits DB hard. Race condition on write: two reads can both miss, both query DB, one stores stale data.\nWhen to use: Read-heavy workloads where occasional cache misses are acceptable. Most caching scenarios.\nWrite-Through # Every write goes to both cache and DB simultaneously. Reads are always warm (if the data was ever written).\nWRITE: 1. Write to DB and cache atomically. READ: 1. Always hit cache (for recently written data). Advantages: Cache always has fresh data for recently written records. No cache miss on first read.\nDisadvantages: Write latency includes cache write. Caches data that may never be read (infrequently accessed writes still fill the cache). Cache storage must be large enough to hold write-through data.\nWhen to use: Systems where write latency is acceptable and read-after-write consistency matters (user profile updates, settings changes).\nWrite-Behind (Write-Back) # Writes go to cache first, DB is updated asynchronously.\nWRITE: 1. Write to cache → return success to caller. 2. Async: flush to DB (batched or periodic). READ: 1. Read from cache. Advantages: Write latency is minimized (cache write is fast). Can batch writes to DB for efficiency.\nDisadvantages: Risk of data loss if cache fails before flush. Complex failure handling. Reads might see data not yet in DB. Strong consistency guarantees are hard.\nWhen to use: High write throughput scenarios where some data loss is acceptable (analytics counters, activity tracking, view counts). Almost never for financial or critical transactional data.\nCache Invalidation: The Hard Problem # \u0026ldquo;There are only two hard things in computer science: cache invalidation and naming things.\u0026rdquo; The reason it\u0026rsquo;s hard: distributed systems don\u0026rsquo;t provide atomicity across a database write and a cache invalidation.\nPattern 1: TTL-based expiry Every cache entry has a time-to-live. After expiry, the next read misses and refills from DB. Simple, safe, but means serving stale data up to TTL seconds.\nRight call: Most data is OK to be stale by a few seconds or minutes. Use TTL as your default strategy and reserve event-based invalidation for data where staleness is genuinely harmful.\nPattern 2: Event-driven invalidation On write, publish an event (via Kafka, Redis pub/sub, database trigger) that invalidates the cache entry. Near-real-time freshness.\nRisk: Race condition — read → cache miss → DB read → publish event → cache write → invalidation arrives → entry deleted. The refilled entry is immediately invalidated. Under high concurrency this can cause cache thrashing.\nPattern 3: Cache-aside with versioned keys Instead of invalidating, change the cache key (include a version or timestamp). Old entries naturally expire via TTL. Eliminates invalidation races at the cost of more cache memory.\nPattern 4: Read-through with write invalidation Systematic invalidation tied to the write path. Works when writes are serialized through a single service that owns both the data and its cache.\nWhen Caching Makes Things Worse # Low hit rate: If your hit rate is \u0026lt; 80–90%, the overhead of cache lookups + misses may exceed the DB savings. Profile before assuming caching helps. Wrong granularity: Caching entire user objects when you only need the name field. Cache bloat → more evictions → lower hit rate. Cache stampede: All TTLs expire simultaneously at scale. Every request misses and floods the DB. Solution: randomize TTL (+/- 10–20% of base TTL), or use probabilistic early expiration (refresh when a small fraction of requests notice TTL is close to expiry). Memory pressure causes evictions: Cache is too small, eviction policy kicks in for hot data. Monitor eviction rate — it should be near zero for important data. Caching mutable data without invalidation: The bug where a user changes their email but the cache serves the old email for 24 hours. Caching at the wrong layer: Adding application cache when the DB query is just missing an index. Fix the root cause. The 95% Hit Rate Question # \u0026ldquo;Your cache hit rate is 95% but latency is still bad — what do you investigate?\u0026rdquo;\nA 95% hit rate sounds good, but at 1000 req/s that\u0026rsquo;s still 50 misses/second. If each miss takes 200ms (slow DB query), those 50 misses are dominating your p95/p99 latency even though your average looks fine. Look at:\nLatency distribution, not just averages. p99 tells the story, not p50. Are misses on specific keys? (Hot miss pattern — new content, cache eviction of specific keys) DB query performance on cache misses. Fix slow queries even if they\u0026rsquo;re infrequent. Thundering herd on misses. Multiple requests simultaneously miss the same key, all hit the DB. Network latency to Redis. If Redis is in a different AZ, add that to your analysis. ","date":"7 April 2026","externalUrl":null,"permalink":"/posts/system-design-basics/caching-strategies/","section":"Posts","summary":"Caching is the single highest-leverage performance tool available — and also one of the most common sources of production bugs. The decision isn’t just “should we cache?” — it’s where, how, and what the consistency implications are.\n","title":"Caching Strategies: Placement, Patterns, and Pitfalls","type":"posts"},{"content":"NoSQL isn\u0026rsquo;t a single thing — it\u0026rsquo;s five different database families with fundamentally different data models, consistency guarantees, and use cases. Using the wrong family (or the wrong database within a family) is a common and costly mistake. Here\u0026rsquo;s how to think through each one.\nDocument Stores: MongoDB, DynamoDB, Firestore # Data model: Each record is a self-contained JSON-like document. Collections of documents, each with its own structure.\nStrengths:\nNatural fit for entities with variable structure (product catalog, CMS content, user profiles with optional fields) Efficient reads when you need the whole entity (no joins — everything is in one document) Flexible schema for rapid iteration MongoDB:\nRich query language — you can query on any field, including nested fields Aggregation pipeline for complex queries Atlas search for full-text ACID transactions across multiple documents (with overhead) Good fit: content management, product catalogs, user profiles, applications needing flexible schema and rich querying DynamoDB:\nFully managed, serverless, infinite scale with no ops Single-digit millisecond latency at any scale Massive limitation: you must design your access patterns upfront. You get a primary key + optional sort key, and Global Secondary Indexes (GSIs). Ad-hoc queries across arbitrary fields are painful or impossible. Good fit: high-scale applications with well-defined, limited access patterns — session storage, leaderboards, IoT event data, gaming The EM interview question: \u0026ldquo;Would you use DynamoDB for user account management?\u0026rdquo; — Depends on the queries. If it\u0026rsquo;s always \u0026ldquo;get user by ID,\u0026rdquo; fine. If you need \u0026ldquo;find all users who signed up in the last 30 days with email_verified = false,\u0026rdquo; you\u0026rsquo;re fighting DynamoDB. MongoDB vs DynamoDB:\nNeed rich querying on arbitrary fields → MongoDB Need infinite scale with no ops overhead + access patterns are known + AWS-native → DynamoDB Need multi-region active-active with minimal ops → DynamoDB (Global Tables) Key-Value Stores: Redis, DynamoDB (KV mode), Memcached # Data model: Pure lookup by key → value. The simplest possible model.\nRedis:\nIn-memory with persistence options (RDB snapshots, AOF log) Rich data structures: strings, hashes, lists, sets, sorted sets, streams, bitmaps, HyperLogLog Sorted sets are the power feature: leaderboards, time-series, range queries, rate limiting Pub/sub, Lua scripting, atomic operations (INCR, GETSET) Redis Streams for event sourcing / lightweight message queue Good fit: caching, session storage, rate limiting, leaderboards, real-time analytics, distributed locks, pub/sub Memcached:\nPure LRU cache. No persistence, no rich types, simpler. Slightly faster than Redis for pure cache workloads at extreme scale Multi-threaded by design (Redis was single-threaded until Redis 6) The honest truth: almost no new projects should choose Memcached over Redis. Redis does everything Memcached does and more. Wide-Column Stores: Cassandra, ScyllaDB, HBase # Data model: Tables with rows identified by a partition key. Within a partition, rows are sorted by a clustering key. Partitions distribute across nodes.\nKey properties:\nDesigned for extreme write throughput — writes are appended to commit log + memtable (sequential I/O, very fast) Linear horizontal scalability — add nodes, get proportional throughput Tunable consistency — write to any number of replicas (QUORUM, ALL, ONE) No joins, no transactions across partitions Schema must match your query patterns. You design tables for queries, not for normalization. Cassandra:\nWrite-heavy workloads at scale: time-series data, IoT telemetry, activity logs, audit trails Good for: \u0026ldquo;write 1M events/second, read the last 100 events for user X\u0026rdquo; Bad for: ad-hoc queries, aggregations, data with evolving access patterns ScyllaDB:\nDrop-in Cassandra replacement written in C++ (vs Java). ~10x higher throughput per node, lower latency, lower operational overhead. If you\u0026rsquo;re choosing Cassandra, seriously evaluate ScyllaDB first. Cassandra vs DynamoDB for write-heavy time-series:\nDynamoDB: no ops, scales automatically, but you pay per WCU and RCU (can get expensive at high volume), less control over data model Cassandra/ScyllaDB: ops overhead but predictable cost at high volume, full control over partitioning strategy At very high write volumes on AWS, DynamoDB becomes expensive faster than running ScyllaDB on EC2 Graph Databases: Neo4j, Amazon Neptune # Data model: Nodes (entities) and edges (relationships), each with properties.\nThe key insight: Graph databases are for queries where the relationships themselves are the primary data — not just what things are, but how they connect, through how many hops, in what path.\nWhen they win:\nFraud detection: \u0026ldquo;Is this account connected to known fraudulent accounts within 3 hops?\u0026rdquo; Social networks: \u0026ldquo;What\u0026rsquo;s the shortest path between user A and user B? Who do they know in common?\u0026rdquo; Recommendation engines: \u0026ldquo;What products did people with similar purchase patterns buy?\u0026rdquo; Knowledge graphs, dependency mapping, org chart traversal Access control: \u0026ldquo;Does this user have permission to this resource through any role path?\u0026rdquo; When they lose:\nSimple entity storage with occasional relationship queries — a relational DB with proper indexes handles this fine High-write-throughput scenarios — graph DBs prioritize relationship traversal, not bulk ingestion Anything where your main query is \u0026ldquo;give me all nodes of type X\u0026rdquo; — that\u0026rsquo;s a table scan, not a graph query The EM test: If you can frame your key queries as \u0026ldquo;traverse these relationships\u0026rdquo; and the relationship depth matters, a graph DB is worth evaluating. If your \u0026ldquo;graph\u0026rdquo; queries are just simple joins, stay relational.\nSearch Engines: Elasticsearch, OpenSearch, Solr # Data model: Inverted index. Documents indexed with full-text analysis, scored by relevance.\nWhat they\u0026rsquo;re built for:\nFull-text search with relevance ranking (BM25 algorithm) Faceted search (filter by category AND price range AND brand simultaneously) Aggregations and analytics over large datasets Fuzzy matching, stemming, synonyms, autocomplete Log aggregation and analysis (the \u0026ldquo;ELK stack\u0026rdquo; / \u0026ldquo;EFK stack\u0026rdquo; for Kubernetes logs) Elasticsearch as primary store — when it works:\nProduct search where the read pattern is exclusively full-text + faceted search Log/event data where you\u0026rsquo;re querying recent time windows Elasticsearch as primary store — the risks:\nNo ACID. Documents are eventually visible after indexing. Not suitable for transactional writes or consistent reads Schema is set at index creation — reindexing is an expensive operation At-scale cluster management is non-trivial (shard sizing, replication, JVM tuning) The pattern: Use Elasticsearch/OpenSearch as a secondary index synced from your primary database (via CDC or dual-write). Your primary store is Postgres; you index the searchable fields into Elasticsearch for search queries. You lose a small amount of freshness but keep transactional integrity.\nOpenSearch vs Elasticsearch: OpenSearch is the AWS-maintained fork after Elastic changed its license. If you\u0026rsquo;re on AWS and using managed search, OpenSearch Service is the natural choice. If self-hosting or on GCP, Elasticsearch is fine.\nDecision Summary # If you need\u0026hellip; Use Variable-schema entities, rich queries MongoDB Infinite scale, known access patterns, AWS-native DynamoDB Caching, sessions, rate limiting, leaderboards Redis Extreme write throughput, time-series, append-heavy Cassandra / ScyllaDB Relationship traversal, fraud detection, social graph Neo4j / Neptune Full-text search, faceted navigation, log analysis Elasticsearch / OpenSearch Everything else (default) PostgreSQL The most important rule: don\u0026rsquo;t add a database you don\u0026rsquo;t need. Every additional store is operational overhead, another thing to monitor, another failure point, another set of runbooks. The default should always be \u0026ldquo;can Postgres handle this?\u0026rdquo; The answer is often yes.\n","date":"7 April 2026","externalUrl":null,"permalink":"/posts/system-design-basics/nosql-families/","section":"Posts","summary":"NoSQL isn’t a single thing — it’s five different database families with fundamentally different data models, consistency guarantees, and use cases. Using the wrong family (or the wrong database within a family) is a common and costly mistake. Here’s how to think through each one.\n","title":"NoSQL Families: Choosing the Right Tool","type":"posts"},{"content":"SQL is SQL until it isn\u0026rsquo;t. When you\u0026rsquo;re making a database selection for a new service, the choice between PostgreSQL, MySQL, and SQL Server comes down to features, ecosystem, operational model, and political reality. Here\u0026rsquo;s how to reason through it.\nPostgreSQL: The Default Choice for Most New Work # Postgres is the right default for most new greenfield services at most companies. The reasons are concrete:\nFeature set that matters in practice:\nJSONB: Binary JSON stored with indexing support. You get SQL querying power over semi-structured data. Hybrid approach: structured fields (user_id, created_at, status) as columns + flexible attributes as JSONB. This is genuinely useful — it\u0026rsquo;s not a NoSQL replacement, it\u0026rsquo;s an escape hatch for variable-schema data without leaving your transactional database. Window functions: ROW_NUMBER(), RANK(), LAG()/LEAD(), running totals — essential for analytics queries that would otherwise require multiple subqueries or application-side logic. CTEs (Common Table Expressions): Readable, composable, recursive queries. Postgres CTEs are materialized by default (tunable) which is an important optimizer consideration. Partial indexes: Index only rows matching a condition (CREATE INDEX ON orders(created_at) WHERE status = 'PENDING'). Dramatically smaller index for the queries that need it. LISTEN/NOTIFY: Lightweight pub/sub within Postgres. Services can subscribe to database-level events. Often used for simple event-driven patterns without introducing Kafka. Full-text search: Built-in tsvector/tsquery — not Elasticsearch, but handles many search requirements without adding another system. Strong type system: Native UUID, array, hstore, range types. Not just VARCHAR and INT. Logical replication: Feeds CDC tools (Debezium), streaming to data warehouses, multi-region setups. Ecosystem and licensing: Open source (PostgreSQL License), no commercial licensing concerns, huge community, works on every cloud and on-prem.\nMySQL: Still Valid, Specific Trade-offs # MySQL (and its drop-in-compatible Aurora MySQL) is still a solid choice, especially if your team has deep MySQL expertise or if you\u0026rsquo;re in an environment where Aurora MySQL is the standard.\nWhere MySQL has historically lagged Postgres:\nLess complete SQL standard support (historically lacked window functions in older versions, CTEs, etc.) — much of this was addressed in MySQL 8.0 Weaker full-text search No JSONB equivalent — has JSON type but indexing is more limited InnoDB\u0026rsquo;s behavior around locking and MVCC differs subtly from Postgres Where MySQL tends to be preferred:\nTeams with deep MySQL expertise and existing tooling around it High-read workloads where MySQL\u0026rsquo;s simpler replication model (binlog) is well-understood WordPress/PHP/LAMP ecosystem (effectively MySQL by default) When you\u0026rsquo;re on AWS and Aurora MySQL meets your needs — it\u0026rsquo;s extremely mature Honest take: For new services at a company not already standardized on MySQL, Postgres is usually the better long-term choice on features. But if your DBAs know MySQL deeply and your tooling is built around it, the migration overhead to Postgres rarely pays off.\nSQL Server: Enterprise, Windows, and Microsoft Shops # SQL Server is the right answer when:\nYou\u0026rsquo;re in a .NET / Azure-first environment where SQL Server integration is deep The business requires SQL Server for licensing/support contract reasons You\u0026rsquo;re working with enterprise software (SAP, Dynamics) that runs on SQL Server You need features like SQL Server Reporting Services, Integration Services, or Analysis Services SQL Server is expensive. Licensing for high-core-count servers is significant. For startups or cloud-native teams, this is usually a non-starter unless an enterprise customer or compliance requirement mandates it.\nManaged Cloud SQL: When to Use RDS / Aurora / Cloud SQL # When managed makes sense:\nYou don\u0026rsquo;t have DBAs. Managed services handle patching, backups, failover, and minor version upgrades. You want automated backups with point-in-time recovery (PITR) — essential for production. You need read replicas without operational complexity. You want automated failover (Multi-AZ RDS, Aurora). What you give up with managed:\nControl over OS-level tuning (huge pages, filesystem settings) Access to certain Postgres extensions not supported by RDS Cost — RDS is meaningfully more expensive than a self-managed EC2 instance for the same specs Some advanced configurations require forking to Aurora (e.g., Aurora-specific parameters) Self-hosted makes sense when:\nYou have DBA expertise on the team Cost at scale justifies the operational investment You need specific extensions (PostGIS, TimescaleDB, pgvector) not available on managed You\u0026rsquo;re on-prem or in a private cloud Aurora vs Standard RDS Postgres # Aurora Postgres is a re-implementation of the Postgres wire protocol on top of a distributed storage layer. It\u0026rsquo;s not the same as RDS Postgres — it\u0026rsquo;s a different database that speaks Postgres.\nAurora advantages:\nStorage automatically grows (no need to provision disk upfront) Faster failover (~30s vs ~60–120s for Multi-AZ RDS) Aurora Global Database for cross-region replication with single-digit-millisecond replication lag Aurora Serverless v2 for auto-scaling (minimum ACUs to maximum ACUs, scales down to near-zero) Up to 15 read replicas vs 5 for standard RDS Aurora trade-offs:\nHigher cost than standard RDS (storage cost model is different) Some Postgres extensions and features aren\u0026rsquo;t supported Aurora Serverless v2 cold start latency (scaling from minimum to active) can be a problem for latency-sensitive workloads with spiky traffic The decision: For services that need high availability, fast failover, and global replication — Aurora. For simpler needs, standard RDS Postgres is cheaper and more straightforward. Aurora\u0026rsquo;s cost model only makes sense when you\u0026rsquo;re utilizing the capabilities.\nHeavy Write Bottleneck: Decision Tree # Your DB is the write bottleneck. Walk through this before reaching for sharding:\nStep 1: Profile and diagnose - Identify the slow/hot queries (pg_stat_statements) - Check I/O wait vs CPU — are you I/O bound or CPU bound? - Check for lock contention (pg_locks, pg_stat_activity) Step 2: Query optimization - Missing indexes on write-heavy tables (insertions are only half the story — queries blocking writes via long transactions are often the real issue) - Batch writes instead of individual INSERTs (COPY or batch INSERT) - Reduce write amplification — are you writing the same data multiple times? Step 3: Schema optimization - UNLOGGED tables for data that can be reconstructed (reduces WAL overhead) - Partial indexes to reduce index write overhead - Partition tables by time range (pg_partman) — partition pruning speeds writes to current partition Step 4: Hardware/instance sizing - Upgrade to a larger instance with faster NVMe SSDs - Memory sizing — Postgres buffer pool hit rate should be \u0026gt;99% for hot data - Increase max_wal_size, checkpoint_completion_target for write-heavy workloads Step 5: Write offloading - Queue writes through a buffer (Kafka → consumer → batch insert) - Async write paths for non-critical data Step 6: Connection management - PgBouncer in transaction pooling mode — often eliminates connection overhead that masquerades as write bottleneck Step 7: Read replicas - Many \u0026#34;write bottleneck\u0026#34; problems are actually read-triggered lock contention - Moving heavy reads to replicas reduces lock pressure on primary Step 8 (last resort): Sharding / distributed DB - CockroachDB, CitusDB, or application-layer sharding - This is a significant architecture change — exhaust all above options first The number of teams that jump to step 8 while being at step 2 is remarkable.\n","date":"7 April 2026","externalUrl":null,"permalink":"/posts/system-design-basics/sql-flavors/","section":"Posts","summary":"SQL is SQL until it isn’t. When you’re making a database selection for a new service, the choice between PostgreSQL, MySQL, and SQL Server comes down to features, ecosystem, operational model, and political reality. Here’s how to reason through it.\n","title":"SQL Flavors: Postgres vs MySQL vs SQL Server","type":"posts"},{"content":"\u0026ldquo;Should we use SQL or NoSQL?\u0026rdquo; is one of the most common — and most misunderstood — architecture questions. Teams default to NoSQL because it sounds modern or scalable, or to SQL because it\u0026rsquo;s familiar. Neither is the right reason. The decision should come from your data\u0026rsquo;s shape, consistency requirements, and access patterns.\nWhen Relational Wins # Use a relational database when:\n1. Your data has relationships you\u0026rsquo;ll query across. If you regularly join orders to users to products to promotions, a relational model with foreign keys and proper indexes is cleaner and faster than assembling that from multiple document fetches or denormalized data.\n2. You need ACID transactions that span multiple entities. Transferring money between accounts, reserving inventory while recording an order, updating multiple tables atomically — these are relational databases\u0026rsquo; core strength. Multi-document transactions in MongoDB exist but carry overhead and aren\u0026rsquo;t always supported across all deployment topologies.\n3. Your schema is relatively stable and well-understood. The discipline of a schema is a feature, not a limitation. It catches bugs at write time instead of read time, enforces invariants, and makes the data self-documenting.\n4. You have complex reporting or ad-hoc queries. SQL is a powerful, flexible query language. Window functions, CTEs, aggregations, complex joins — doing this against a document store is painful.\n5. You value operational maturity. PostgreSQL, MySQL, SQL Server have decades of tooling, DBA expertise, migration tools, monitoring integrations, and community knowledge. You\u0026rsquo;ll find an answer to almost any production problem in a Stack Overflow thread.\nWhen Document Stores Win # Use a document database (MongoDB, DynamoDB, Firestore) when:\n1. Your data naturally maps to a document. A product catalog where each product has different attributes (a t-shirt has size/color, a TV has resolution/refresh-rate) is awkward in a relational schema (nullable columns or EAV tables). In a document store, each product is just a document with whatever fields it needs.\n2. You need schema flexibility during rapid iteration. In early product development, the schema changes every sprint. With a document store, adding a new field doesn\u0026rsquo;t require a migration — it just exists on new documents. (Warning: this also means you accumulate technical debt in the form of old documents missing new fields. Eventually you pay this debt in application code.)\n3. Your read access pattern is almost always \u0026ldquo;get everything for one entity.\u0026rdquo; If you\u0026rsquo;re almost always fetching \u0026ldquo;the entire user profile\u0026rdquo; or \u0026ldquo;the entire order with all line items,\u0026rdquo; denormalizing into a document is faster than joining five tables.\n4. You need horizontal write scale from day one. Document stores typically shard more naturally than relational databases. If your write volume is extreme and you know it from the start, a document store may be the right choice. (That said, Postgres can handle a lot more write throughput than most teams think before sharding becomes necessary.)\nThe \u0026ldquo;MongoDB is Faster\u0026rdquo; Response # When a team says \u0026ldquo;we want MongoDB because it\u0026rsquo;s faster,\u0026rdquo; the EM question is: faster for what?\nMongoDB can be faster for simple key-lookup reads of a single document — no join cost. But PostgreSQL with proper indexing on a single-row fetch is comparably fast. \u0026ldquo;MongoDB is faster\u0026rdquo; often reflects an experience where someone ran Postgres without indexes, or did a query that would benefit from denormalization, and then compared it to a MongoDB query on a pre-denormalized document. The comparison was unfair.\nWhat to probe:\nWhat queries are you optimizing? Have you profiled the SQL queries and confirmed they\u0026rsquo;re the bottleneck? Is the schema designed to support the access patterns (or is it a normalized academic schema never optimized for production)? Often the right answer is to optimize the relational queries first. NoSQL introduces significant operational complexity (eventual consistency, no joins, limited transactions) that shouldn\u0026rsquo;t be accepted without a real need.\nPolyglot Persistence: Multiple Stores in One System # Using different databases for different parts of your system is valid — but it\u0026rsquo;s a complexity budget decision.\nLegitimate use cases:\nCore transactional data in Postgres; full-text search in Elasticsearch; session/cache in Redis. These are genuinely different access patterns that different stores are optimized for. Product catalog in MongoDB; order management in Postgres. Product data is highly variable-schema; order data is structured and transactional. Warning signs:\nUsing polyglot persistence without clear ownership boundaries. If two services share a database, you\u0026rsquo;ve created coupling. If one service spans two databases with joins between them, you\u0026rsquo;ve created a nightmare. Choosing a NoSQL store for \u0026ldquo;future flexibility\u0026rdquo; without a concrete use case. The operational overhead of running and maintaining multiple database systems is real — you need separate backups, monitoring, expertise, and runbooks. Cross-Service Transactions # \u0026ldquo;How do you handle transactions when each service owns its own database?\u0026rdquo; is a standard EM-level question. The answer:\nYou don\u0026rsquo;t get distributed ACID — you use patterns that achieve eventual consistency:\nSaga pattern: Break a distributed transaction into a sequence of local transactions with compensating transactions for rollback. Choreography-based (events trigger next steps) or orchestration-based (a coordinator directs the sequence). Outbox pattern: Write the event to a local table in the same database transaction as the business data. A separate process reads and publishes it. Guarantees at-least-once event delivery without two-phase commit. Two-phase commit (2PC): Theoretically possible but rarely used in microservices — it requires a coordinator, is slow, and failure modes are complex. Avoid unless you have no alternative. The key insight: distributed transactions are usually the wrong framing. Better to ask \u0026ldquo;can I redesign the service boundaries so one service owns this entire operation?\u0026rdquo; Service boundary design should minimize cross-service coordination.\nWhen to Shard # Sharding adds massive operational complexity. Before reaching for it:\nOptimize queries first. Missing indexes, inefficient queries, full table scans — fix these first. Read replicas. Most applications are read-heavy. Adding read replicas handles 80% of scaling needs with minimal risk. Connection pooling. Postgres with PgBouncer handles 10,000+ connections on hardware that would crumble under naive direct connections. Caching. A well-placed cache eliminates a class of database reads entirely. Vertical scaling. Modern cloud instances offer 192 cores and 24TB of RAM. Vertical scaling is unfairly dismissed — it\u0026rsquo;s often the right answer for another 2–3 years. Signals that sharding might be necessary:\nSingle-node write throughput is saturated (not read — that\u0026rsquo;s replicas) The dataset is too large for a single node\u0026rsquo;s storage (though with SSDs and cloud volumes this is rarer than it sounds) You have strict data residency requirements (sharding by region) Even then, evaluate whether a managed distributed database (Aurora, CockroachDB, PlanetScale) abstracts the sharding complexity before building a custom sharding layer.\n","date":"7 April 2026","externalUrl":null,"permalink":"/posts/system-design-basics/sql-vs-nosql/","section":"Posts","summary":"“Should we use SQL or NoSQL?” is one of the most common — and most misunderstood — architecture questions. Teams default to NoSQL because it sounds modern or scalable, or to SQL because it’s familiar. Neither is the right reason. The decision should come from your data’s shape, consistency requirements, and access patterns.\n","title":"SQL vs NoSQL: Making the Right Call","type":"posts"},{"content":"Spring Boot is the backbone of most Java microservice ecosystems. As an EM, you\u0026rsquo;re not expected to know every annotation — but you should be able to drive the architectural decisions: MVC vs WebFlux vs virtual threads, Boot 2 vs 3 migration, observability strategy, and testing approach. Here\u0026rsquo;s the full evolution with the trade-offs that matter.\nSpring Boot 1.x / Spring Framework 4.x — The Baseline # The auto-configuration model (@SpringBootApplication scanning the classpath) replaced XML config, starters eliminated manual dependency management, and embedded Tomcat killed the WAR file deployment model. If your org still builds WARs, that\u0026rsquo;s a conversation worth having.\nThe programming model was simple and synchronous: @RestController → dispatcher servlet → blocking thread per request. This works fine up to a few hundred concurrent requests per instance.\nSpring Framework 5 / Spring Boot 2.0 (2018) — Reactive Arrives # WebFlux # Spring Framework 5 introduced WebFlux — a fully non-blocking web stack built on Project Reactor (Mono\u0026lt;T\u0026gt; for one item, Flux\u0026lt;T\u0026gt; for a stream). Instead of blocking a thread while waiting for I/O, the thread is released and a callback fires when data is ready.\nThe promise: Handle more concurrent connections with fewer threads. A service doing thousands of concurrent outbound HTTP calls — e.g., a fan-out aggregator — can run on a handful of threads.\nThe cost: The reactive programming model is genuinely harder. Stack traces become nearly useless (they show reactor internals, not your code). Debugging requires understanding Reactor\u0026rsquo;s execution model. Onboarding new engineers takes longer. Libraries that aren\u0026rsquo;t reactive-native (legacy JDBC, certain clients) block carrier threads and undermine the model.\nThe honest EM take: Most teams adopted WebFlux for the wrong reasons — \u0026ldquo;it\u0026rsquo;s faster\u0026rdquo; is not sufficient. WebFlux shines when you have true backpressure requirements or when you\u0026rsquo;re doing high-concurrency I/O aggregation and can\u0026rsquo;t go to Java 21 virtual threads. For everything else, the complexity cost outweighs the throughput gain.\nSpring Boot 2.1–2.3 — Operational Maturity # Actuator overhaul: Health endpoints, metrics via Micrometer. Micrometer is a vendor-neutral metrics facade — your code emits metrics once, and you plug in Prometheus, Datadog, CloudWatch, or anything else via a dependency. This is the right abstraction; use it.\nLayered JARs and Buildpacks (2.3): Docker image optimization. Layered JARs separate dependencies from app classes, so rebuilds only push the changed layer. Cloud Native Buildpacks (spring-boot:build-image) produce OCI images without writing a Dockerfile. For teams struggling with Docker image maintenance, this reduces friction significantly.\nGraceful shutdown: Added in 2.3. When the app receives SIGTERM, it stops accepting new requests but finishes in-flight ones. Essential for Kubernetes zero-downtime deploys. Default is disabled — enable it: server.shutdown=graceful.\nSpring Boot 2.4–2.7 — Config and Cloud # Config import (spring.config.import): Replaced the bootstrap.yml / Spring Cloud Config bootstrap context with a cleaner import mechanism. If your team uses Spring Cloud Config Server, this changes how config is loaded and can break existing setups on upgrade. Test this carefully.\nProfile YAML documents: A single application.yml can contain multiple profile-specific sections using --- separators.\nVolume-mounted config trees: Reads Kubernetes ConfigMap and Secret key-value pairs from mounted filesystem paths — clean integration without custom bootstrap code.\nSpring Boot 3.0 / Spring Framework 6 (Late 2022) — The Breaking Jump # This is the release where \u0026ldquo;check it against your dependency list first\u0026rdquo; became mandatory advice.\nJava 17 Minimum # No more Java 8 or Java 11. If you\u0026rsquo;re on Boot 2.x with Java 11, the Boot 3 migration forces a Java upgrade. Usually fine, but plan for it.\nJakarta EE 9 — The Painful Part # Every javax.* import becomes jakarta.*. This sounds mechanical but it\u0026rsquo;s pervasive:\njavax.servlet.http.HttpServletRequest → jakarta.servlet.http.HttpServletRequest javax.persistence.* → jakarta.persistence.* javax.validation.* → jakarta.validation.* Any library that hasn\u0026rsquo;t published a Jakarta-compatible version is a blocker. This is the primary reason Boot 2 → 3 migrations stall. Run a dependency audit before planning the migration timeline.\nGraalVM Native Image # Ahead-of-time compilation to a native executable: no JVM startup, sub-100ms startup time, ~10x less memory than JVM. Sounds transformative.\nTrade-offs that matter:\nBuild times are long (minutes, not seconds). CI pipelines need adjustment. Reflection, dynamic proxies, and classpath scanning require configuration hints. Spring provides many automatically, but third-party libraries may not. Dynamic features (some Hibernate behaviors, certain Spring Data queries) may fail at runtime if not configured correctly in AOT mode. Best fit: Serverless functions, scale-to-zero workloads, CLI tools. For always-on services, startup time doesn\u0026rsquo;t matter — CDS (Class Data Sharing) is a better middle ground. Observability Overhaul # Spring Cloud Sleuth (distributed tracing) is dead — replaced by Micrometer Tracing which builds on the Micrometer Observation API. The unified model: one @Observed annotation or Observation API call instruments metrics, traces, and logs together. OpenTelemetry is supported natively.\nWhy this matters architecturally: Your observability stack in Boot 3 should be Micrometer + OpenTelemetry exporter → your backend (Tempo, Jaeger, Zipkin, or a commercial APM). Don\u0026rsquo;t fight the framework.\nHTTP Interfaces # Declarative HTTP clients, similar to Feign but built into the framework:\n@HttpExchange(\u0026#34;https://api.example.com\u0026#34;) interface UserClient { @GetExchange(\u0026#34;/users/{id}\u0026#34;) User getUser(@PathVariable String id); } Generated by a proxy, no implementation needed. Works with the new RestClient and WebClient. For internal service-to-service calls, this is cleaner than manual RestTemplate or Feign configuration.\nProblem Details (RFC 7807) # Standard error response format: type, title, status, detail, instance. Enabled via spring.mvc.problemdetails.enabled=true. Useful when your API consumers are external or need machine-readable errors.\nSpring Boot 3.1 — Developer Experience # Docker Compose support: Add spring-boot-docker-compose and Boot auto-starts your compose.yml on startup in development. No more \u0026ldquo;remember to start your local Postgres before running the app.\u0026rdquo;\nTestcontainers integration (@ServiceConnection): Define a Testcontainers container in test config and Spring Boot auto-wires the connection properties. Real database, real Redis, real Kafka — in tests, with zero manual URL configuration.\n@SpringBootTest class OrderServiceTest { @Container @ServiceConnection static PostgreSQLContainer\u0026lt;?\u0026gt; postgres = new PostgreSQLContainer\u0026lt;\u0026gt;(\u0026#34;postgres:15\u0026#34;); // Spring Boot reads connection details automatically — no @DynamicPropertySource needed } This is the single best improvement to Spring Boot testing in years. Use it.\nSpring Boot 3.2 — Virtual Threads # The headline feature: one configuration property to run Tomcat on virtual threads:\nspring: threads: virtual: enabled: true All request handling moves to virtual threads. Each blocking call — database query, external HTTP call — parks the virtual thread instead of blocking an OS thread. You get WebFlux-level concurrency with Spring MVC\u0026rsquo;s straightforward programming model.\nRestClient: New synchronous HTTP client, modern replacement for RestTemplate (which is in maintenance mode, not removed). Fluent API:\nRestClient client = RestClient.create(); User user = client.get() .uri(\u0026#34;https://api.example.com/users/{id}\u0026#34;, id) .retrieve() .body(User.class); JdbcClient: Fluent JDBC API that makes the JdbcTemplate API much less verbose.\nSpring Boot 3.3–3.4 — Refinement # Structured logging: JSON logs out of the box with logging.structured.format.console=ecs or logstash. In Kubernetes where logs go to ELK/Loki, JSON is far better than text — no log parsing regex needed.\nCDS (Class Data Sharing) polish: Improved tooling for creating class data archives, reducing JVM startup time by 20–40% without going full native image. Good middle ground for teams that want faster startup without GraalVM complexity.\nSpring Security: Lambda DSL is now the only way (WebSecurityConfigurerAdapter was removed in 6.x). If you have legacy security config, it must be rewritten:\n@Bean SecurityFilterChain securityFilterChain(HttpSecurity http) throws Exception { return http .authorizeHttpRequests(auth -\u0026gt; auth .requestMatchers(\u0026#34;/public/**\u0026#34;).permitAll() .anyRequest().authenticated() ) .oauth2ResourceServer(oauth2 -\u0026gt; oauth2.jwt(Customizer.withDefaults())) .build(); } The Architectural Decision Matrix # MVC vs WebFlux vs MVC + Virtual Threads # Spring MVC WebFlux MVC + Virtual Threads (3.2+) Programming model Imperative, simple Reactive, complex Imperative, simple Concurrency model Thread per request Event loop + callbacks Virtual thread per request Debugging Normal stack traces Reactor internals Normal stack traces Throughput (I/O bound) Good Excellent Excellent Backpressure No Yes No Hire for Easy Hard Easy Best fit Most services High-concurrency I/O, streaming Most services on Java 21 The recommendation today: If you\u0026rsquo;re on Spring Boot 3.2+ and Java 21, enable virtual threads and stay with Spring MVC. You get most of WebFlux\u0026rsquo;s throughput benefits without its complexity. Only choose WebFlux if you specifically need backpressure or are already invested in the reactive stack.\nBoot 2 → 3 Migration Playbook # Dependency audit first. Identify every javax.* import and every third-party library. Check if Jakarta EE 9-compatible versions exist. Java 17 upgrade as a separate step from Boot upgrade. Upgrade to Boot 2.7.x (last 2.x release) — it includes deprecation warnings for things removed in Boot 3. Fix deprecated usages — WebSecurityConfigurerAdapter, old config bootstrap, removed APIs. Upgrade to Boot 3.0 — expect javax.* → jakarta.* compile errors. Use IntelliJ\u0026rsquo;s \u0026ldquo;Migrate to Jakarta EE 9\u0026rdquo; refactoring. Run full test suite. Testcontainers integration tests will catch runtime issues native compilation might not. Enable virtual threads (3.2+) and validate no synchronized pinning issues. Spring Data Evolution # Spring Data JDBC matured as a lighter alternative to JPA. It\u0026rsquo;s explicit — no lazy loading, no transparent dirty checking, no session cache. What you call is what executes. For teams burned by Hibernate surprises (N+1 queries, LazyInitializationException), Spring Data JDBC is worth considering.\nR2DBC (Reactive Relational Database Connectivity) is the non-blocking database driver layer for WebFlux apps. If you\u0026rsquo;re committed to the reactive stack, it\u0026rsquo;s the right tool. Otherwise, JDBC + virtual threads is simpler.\nSpring Cloud — Know What It Does, Know When to Skip It # Spring Cloud components worth knowing:\nConfig Server: Centralized externalized config. Viable but many teams migrate to Kubernetes ConfigMaps/Secrets + Vault. Gateway: API gateway built on WebFlux. Solid. Resilience4j: Replaced Hystrix for circuit breakers. Framework-agnostic; can use standalone or with Spring Boot starters. Service discovery (Eureka/Consul): Many teams moved to service mesh (Istio) or rely on Kubernetes DNS instead. EM trade-off discussion: \u0026ldquo;Do you need Spring Cloud or does your infrastructure solve it?\u0026rdquo; Kubernetes + Istio handles service discovery, traffic management, mTLS, and circuit breaking at the infrastructure layer — no application library changes needed. Spring Cloud still makes sense when you need application-level awareness (e.g., client-side load balancing with routing logic) or when you\u0026rsquo;re not on Kubernetes.\n","date":"7 April 2026","externalUrl":null,"permalink":"/posts/spring/spring-boot-evolution/","section":"Posts","summary":"Spring Boot is the backbone of most Java microservice ecosystems. As an EM, you’re not expected to know every annotation — but you should be able to drive the architectural decisions: MVC vs WebFlux vs virtual threads, Boot 2 vs 3 migration, observability strategy, and testing approach. Here’s the full evolution with the trade-offs that matter.\n","title":"Spring Boot Evolution: 1.x to 3.4 — What Every EM Needs to Know","type":"posts"},{"content":"Garbage collection is one of those topics where \u0026ldquo;I let the JVM handle it\u0026rdquo; is a perfectly valid answer until it isn\u0026rsquo;t — and for EMs, that inflection point usually shows up as unexplained latency spikes in production, OOM kills in containers, or a team paralyzed by which GC flag to tweak. Here\u0026rsquo;s the full picture from Java 8 through 21.\nThe Baseline: Java 8 Collectors # Parallel GC (the Java 8 default) # Stop-the-world collection on both minor (young gen) and major (old gen) GCs. All available CPU cores run the collection in parallel — hence the name. Good for batch and throughput-oriented workloads where pause time doesn\u0026rsquo;t matter, terrible for latency-sensitive services.\nIf you\u0026rsquo;ve ever seen a Spring Boot service pause for 500ms–2s randomly, and the app has been running since Java 8 days, Parallel GC with a large old gen is almost certainly the culprit.\nCMS (Concurrent Mark Sweep) # CMS was designed to solve Parallel GC\u0026rsquo;s pause problem by doing most of the marking concurrently with application threads. It worked — pauses dropped significantly — but at a cost:\nFragmentation: CMS didn\u0026rsquo;t compact the heap (no relocation). Over time, the old gen becomes fragmented, triggering a \u0026ldquo;concurrent mode failure\u0026rdquo; which falls back to a full stop-the-world compact — often worse than if you\u0026rsquo;d never used CMS. Complexity: Tuning CMS required understanding initiating occupancy thresholds, incremental mode, and other knobs most teams didn\u0026rsquo;t have time to learn. CPU overhead: Concurrent phases consume significant CPU alongside the application. CMS was deprecated in Java 9 and removed in Java 14. If you\u0026rsquo;re still on it, that\u0026rsquo;s your migration trigger.\nPermGen → Metaspace (Java 8) # PermGen was a fixed-size memory region (outside the heap) storing class metadata. The classic OutOfMemoryError: PermGen space showed up in large applications deploying many classloaders (app servers, OSGi, Groovy-heavy systems). Java 8 replaced it with Metaspace — native memory, grows dynamically. The OOM still happens, just with OutOfMemoryError: Metaspace instead, and is much rarer.\nJava 9: G1 Becomes the Default # G1 (Garbage First) had been around since Java 7 but became the default in Java 9. It represents a fundamentally different approach: instead of a contiguous young/old gen layout, G1 divides the heap into equal-sized regions (~1–32MB each). Young and old generations are still logical concepts, but physically they\u0026rsquo;re sets of regions.\nWhy this matters:\nG1 can predict and meet pause time targets (-XX:MaxGCPauseMillis=200). It achieves this by only collecting enough regions to stay within the pause budget. Handles large heaps (10GB+) better than Parallel/CMS because it can work incrementally. Compacts the heap during collection (no fragmentation like CMS). G1\u0026rsquo;s weak spot: It\u0026rsquo;s a generational collector — throughput takes a hit compared to Parallel GC. For batch jobs or anything throughput-oriented, Parallel GC still wins on raw numbers.\nFor most Spring Boot services: G1 is the right default. It\u0026rsquo;s well-understood, has great tooling, and the 10–200ms pause targets are acceptable for typical microservice workloads.\nJava 11: ZGC and Epsilon Enter # ZGC (Experimental in Java 11) # ZGC is designed around one constraint: pause times under 10ms regardless of heap size. It achieves this by doing almost all work concurrently with the application, including relocation (moving objects). The pause phases (initial mark, pause roots) are bounded and short.\nHow ZGC achieves this: Load barriers + colored pointers. Every reference read goes through a barrier that checks whether an object has been relocated. This has CPU overhead (~15% throughput cost in early versions), but pause times stay flat even on multi-terabyte heaps.\nJava 11–14: Linux x86-64 only, experimental, no generational collection.\nEpsilon GC # A no-op collector. It allocates memory but never frees it. The JVM will OOM once the heap is exhausted.\nSounds useless. It\u0026rsquo;s actually perfect for:\nPerformance benchmarking: Measure raw allocation rate and throughput without GC noise. Compare two algorithms? Run with Epsilon to eliminate GC variability. Ultra-short-lived JVMs: Serverless functions, CLI tools that run for \u0026lt;1 second. If the JVM exits before the heap fills, you paid zero GC overhead. Diagnosing GC impact: Run with Epsilon to see what your actual GC overhead is. Java 14: CMS Removed, ZGC Goes Multi-Platform # CMS is gone entirely. Any codebase using -XX:+UseConcMarkSweepGC needs to migrate (G1 is the safe default).\nZGC becomes available on macOS and Windows (still experimental).\nJava 15: ZGC and Shenandoah Go Production-Ready # Shenandoah GC # Developed by Red Hat, Shenandoah has similar goals to ZGC: sub-millisecond pauses, concurrent relocation. The implementation differs — Shenandoah uses forwarding pointers rather than colored pointers.\nZGC vs Shenandoah: Both aim for ultra-low pauses. Shenandoah tends to perform better on smaller heaps; ZGC on very large heaps. In practice, both are production-viable — your choice often comes down to which JVM distribution you\u0026rsquo;re running (Red Hat / OpenJ9 users often see Shenandoah in their ecosystem).\nBoth become non-experimental in Java 15.\nJava 17: G1 and ZGC Improvements # ZGC gains dynamic scaling of GC threads (previously fixed count) G1 improvements for better throughput and reduced native memory overhead ZUncommit — ZGC can now return unused heap memory to the OS (important in containerized environments where memory limits are strict) Java 21: Generational ZGC — The Big Deal # Before Java 21, ZGC collected the entire heap on every cycle. This was intentional (simpler, easier to get right), but had a cost: high throughput overhead because most objects die young and are being collected alongside long-lived objects.\nGenerational ZGC adds the standard generational hypothesis optimization — separate young/old generations — to ZGC\u0026rsquo;s concurrent, low-pause foundation. Result:\nYoung gen collections are fast and frequent (most objects die young) Old gen is collected less often Throughput overhead drops from ~15% to single digits Pause times remain sub-millisecond This removes the primary reason teams stayed on G1 instead of ZGC. You now get both low pauses and competitive throughput.\nEnable it: -XX:+UseZGC -XX:+ZGenerational (Java 21), or set as default in a future release.\nThe Decision Tree: Which GC to Pick # Is this a batch job / ETL / throughput-only workload? YES → Parallel GC NO ↓ Is this a standard Spring Boot / microservice? YES, Java \u0026lt; 21 → G1 (default, well-understood, good tooling) YES, Java 21+ → G1 or Generational ZGC (worth benchmarking) NO ↓ Do you have hard p99 latency requirements (\u0026lt; 10ms GC pauses)? YES → ZGC (Java 21: Generational ZGC) Consider Shenandoah if on Red Hat / OpenJDK distro Large heap (10GB+) with latency requirements? → ZGC is the clear winner; G1 pauses grow with heap size Performance benchmarking / short-lived JVM? → Epsilon EM-Level Interview Questions and How to Answer Them # \u0026ldquo;Your service has p99 latency spikes every few minutes. How do you diagnose?\u0026rdquo;\nEnable GC logging: -Xlog:gc*:file=gc.log:time,uptime:filecount=5,filesize=20m. Look for long stop-the-world pauses correlating with the latency spikes. Check allocation rate and promotion rate — if the old gen fills too fast, minor GCs promote too aggressively, leading to major GC pressure.\n\u0026ldquo;When would increasing heap size make things worse?\u0026rdquo;\nWith non-concurrent collectors (Parallel GC), a larger heap means less frequent but longer GCs. If you\u0026rsquo;re already on a 16GB heap with G1, doubling to 32GB might push major GC pauses from 200ms to 400ms. With ZGC this is less of a concern — pause times don\u0026rsquo;t scale with heap size.\n\u0026ldquo;Container memory limits and JVM heap — what\u0026rsquo;s the gotcha?\u0026rdquo;\nThe JVM, by default, sizes the heap based on total system memory. Inside a container with a 2GB limit, the JVM sees the host\u0026rsquo;s 64GB and sizes the heap to 16GB+ — instantly getting OOM-killed. Fix: -XX:MaxRAMPercentage=75 (Java 10+) or explicit -Xmx. Also, Metaspace, DirectByteBuffer, thread stacks, and JIT code cache all consume memory outside the heap — your container limit needs headroom for all of them.\n\u0026ldquo;Virtual threads and GC — what\u0026rsquo;s the relationship?\u0026rdquo;\nVirtual threads are cheap to create, which means applications can create millions of them. Each virtual thread has its own stack, which is heap-allocated in small chunks. This increases object allocation rate significantly. Generational collectors handle this well (short-lived stacks in young gen die quickly). This is partly why generational ZGC in Java 21 is so timely — the Loom era increases GC pressure, and generational collection is the right answer.\nQuick Reference: GC Flags # # Enable G1 (default Java 9+) -XX:+UseG1GC -XX:MaxGCPauseMillis=200 # Enable ZGC (Java 15+ production-ready) -XX:+UseZGC # Enable Generational ZGC (Java 21) -XX:+UseZGC -XX:+ZGenerational # Enable Shenandoah -XX:+UseShenandoahGC # Container-aware heap sizing -XX:MaxRAMPercentage=75 # GC logging (essential in prod) -Xlog:gc*:file=gc.log:time,uptime:filecount=5,filesize=20m # Diagnose virtual thread pinning -Djdk.tracePinnedThreads=full ","date":"7 April 2026","externalUrl":null,"permalink":"/posts/java/jvm-gc-evolution/","section":"Posts","summary":"Garbage collection is one of those topics where “I let the JVM handle it” is a perfectly valid answer until it isn’t — and for EMs, that inflection point usually shows up as unexplained latency spikes in production, OOM kills in containers, or a team paralyzed by which GC flag to tweak. Here’s the full picture from Java 8 through 21.\n","title":"JVM Garbage Collection: From Java 8 to 21","type":"posts"},{"content":"Java has changed dramatically since Java 8. As an engineering manager, you don\u0026rsquo;t need to recite the JLS — but you do need to understand why these features exist, the trade-offs they carry, and how they affect the decisions your team makes every day. Here\u0026rsquo;s a curated tour.\nJava 8 — The Paradigm Shift # Java 8 is the most impactful release since generics. Almost everything that followed builds on it.\nLambdas and Functional Interfaces # Lambdas are syntactic sugar over single-method interfaces (@FunctionalInterface). The four core ones you\u0026rsquo;ll see constantly:\nFunction\u0026lt;String, Integer\u0026gt; f = String::length; // T -\u0026gt; R Predicate\u0026lt;String\u0026gt; p = s -\u0026gt; s.isEmpty(); // T -\u0026gt; boolean Consumer\u0026lt;String\u0026gt; c = System.out::println; // T -\u0026gt; void Supplier\u0026lt;String\u0026gt; s = () -\u0026gt; \u0026#34;hello\u0026#34;; // () -\u0026gt; T Why it matters: Enables passing behavior as data, which unlocks the Streams API and CompletableFuture composition. It also pushed teams to think in terms of pipelines rather than loops — a meaningful shift in how code reads.\nStreams API # Streams are lazy, composable, single-use sequences. The canonical pattern:\nlist.stream() .filter(s -\u0026gt; s.startsWith(\u0026#34;A\u0026#34;)) .map(String::toUpperCase) .collect(Collectors.toList()); Parallel streams are the footgun. parallelStream() uses the common ForkJoinPool — shared across the entire JVM. If one slow operation blocks threads, everything using that pool degrades. For most services doing I/O-bound work, parallel streams add overhead rather than saving it. Use them only for CPU-bound, large-dataset operations where the overhead of thread coordination is worth it.\nThe EM question: \u0026ldquo;Your team added .parallelStream() everywhere to speed things up. Now performance is worse under load. Why?\u0026rdquo; — The answer is ForkJoinPool saturation and false assumption of CPU-bound workloads.\nOptional # Optional\u0026lt;T\u0026gt; exists to force callers to acknowledge the possibility of absence. It\u0026rsquo;s not a null replacement everywhere — it\u0026rsquo;s a return type signal.\n// Good: return type communicates nullable Optional\u0026lt;User\u0026gt; findById(String id) { ... } // Bad: method parameter void process(Optional\u0026lt;User\u0026gt; user) { ... } // just use @Nullable or overloads Anti-pattern: optional.get() without isPresent() — you\u0026rsquo;ve just traded a NullPointerException for a NoSuchElementException. Use orElse(), orElseGet(), or ifPresent().\nCompletableFuture # This is Java\u0026rsquo;s model for composing async operations without callback hell:\nCompletableFuture.supplyAsync(() -\u0026gt; fetchUser(id)) .thenApply(user -\u0026gt; enrichWithProfile(user)) // sync transform .thenCompose(user -\u0026gt; fetchOrders(user.id())) // async chaining (flatMap) .exceptionally(ex -\u0026gt; fallbackUser()); thenApply vs thenCompose: thenApply wraps the result (T → U), thenCompose unwraps a returned future (T → CompletableFuture\u0026lt;U\u0026gt;). Getting this wrong gives you CompletableFuture\u0026lt;CompletableFuture\u0026lt;T\u0026gt;\u0026gt;.\nProduction pitfall: exceptionally only handles one stage. If you need consistent error handling across a chain, use handle(). Also, default execution uses ForkJoinPool — pass an explicit executor for I/O operations.\nDefault Methods in Interfaces # Allowed retrofitting new behavior into existing interfaces without breaking all implementations. The Comparator.comparing() static factory and stream-friendly Collection methods (.forEach, .removeIf, .stream()) rely on this.\nDiamond problem: If two interfaces provide the same default method, the implementing class must override it. Design-time decision: default methods are for backwards-compatible evolution, not primary behavior.\njava.time (JSR-310) # java.util.Date was broken by design: mutable, epoch-based, poor timezone support. The new API:\nLocalDate, LocalTime, LocalDateTime — no timezone ZonedDateTime, OffsetDateTime — with timezone Instant — machine time (epoch nanos) Duration, Period — elapsed time Always store and transmit as Instant or OffsetDateTime in UTC. Convert to ZonedDateTime only for display.\nJava 9–11 # Project Jigsaw (Modules) # The module system (module-info.java) solves two problems: strong encapsulation (hiding internal APIs) and reliable configuration (explicit dependency graph). In practice, most teams skip it unless building frameworks or reducing attack surface. The classpath still works fine. Know it exists, know why it exists, don\u0026rsquo;t mandate it without a reason.\nvar — Local Variable Type Inference # var users = new ArrayList\u0026lt;User\u0026gt;(); // clear var x = process(); // bad — what is x? var is a compile-time feature — the type is inferred and fixed. It doesn\u0026rsquo;t make Java dynamically typed. When the right-hand side is obvious, it improves readability. When it hides type information, it hurts. Code review guideline: if a reviewer can\u0026rsquo;t tell the type at a glance, spell it out.\nHttpClient # Replaced HttpURLConnection with a modern API supporting HTTP/1.1, HTTP/2, and WebSocket, with both sync and async modes:\nHttpClient client = HttpClient.newHttpClient(); HttpResponse\u0026lt;String\u0026gt; resp = client.send( HttpRequest.newBuilder(URI.create(\u0026#34;https://api.example.com\u0026#34;)).build(), HttpResponse.BodyHandlers.ofString() ); Collection Factory Methods # List\u0026lt;String\u0026gt; names = List.of(\u0026#34;Alice\u0026#34;, \u0026#34;Bob\u0026#34;); // immutable Map\u0026lt;String, Integer\u0026gt; map = Map.of(\u0026#34;a\u0026#34;, 1, \u0026#34;b\u0026#34;, 2); // immutable, up to 10 entries Key implication: these are truly immutable — UnsupportedOperationException on mutation. Don\u0026rsquo;t pass them to code that tries to add/remove. Also, Map.of does not guarantee insertion order.\nJava 12–17 # Records # record Point(int x, int y) {} Records are transparent data carriers: immutable, with auto-generated constructor, accessors, equals, hashCode, toString.\nWhen to use: DTOs, value objects, data transfer in APIs, method return types grouping related values.\nWhen not: When you need custom validation in the constructor beyond basic assertions, mutable state, or inheritance hierarchies.\nvs Lombok @Value: Records are language-level, no annotation processor needed, slightly less flexible. For greenfield Java 16+, prefer records.\nSealed Classes # sealed interface Shape permits Circle, Rectangle, Triangle {} record Circle(double radius) implements Shape {} record Rectangle(double w, double h) implements Shape {} The compiler knows all permitted subtypes, which means switch expressions can be exhaustively checked. This is the foundation for type-safe domain modeling:\ndouble area = switch (shape) { case Circle c -\u0026gt; Math.PI * c.radius() * c.radius(); case Rectangle r -\u0026gt; r.w() * r.h(); case Triangle t -\u0026gt; /* ... */; // No default needed — compiler verifies exhaustiveness }; EM framing: Sealed classes + records replaces the \u0026ldquo;sum type\u0026rdquo; pattern you\u0026rsquo;d use in Kotlin or Scala. They make illegal states unrepresentable.\nPattern Matching for instanceof # // Before if (obj instanceof String) { String s = (String) obj; System.out.println(s.length()); } // After if (obj instanceof String s) { System.out.println(s.length()); } Eliminates the redundant cast. Seemingly minor, but it pairs powerfully with switch patterns.\nText Blocks # String json = \u0026#34;\u0026#34;\u0026#34; { \u0026#34;name\u0026#34;: \u0026#34;Alice\u0026#34;, \u0026#34;role\u0026#34;: \u0026#34;admin\u0026#34; } \u0026#34;\u0026#34;\u0026#34;; Indentation is stripped to the level of the closing \u0026quot;\u0026quot;\u0026quot;. Useful for SQL, JSON templates, HTML in tests. Watch out for trailing whitespace and the escape sequences (\\s to preserve trailing space, \\ for line continuation).\nSwitch Expressions # int numLetters = switch (day) { case MONDAY, FRIDAY, SUNDAY -\u0026gt; 6; case TUESDAY -\u0026gt; 7; default -\u0026gt; { System.out.println(\u0026#34;Other: \u0026#34; + day); yield day.toString().length(); } }; yield returns a value from a block. Arrow cases don\u0026rsquo;t fall through. The compiler enforces exhaustiveness for enums.\nJava 17–21 — The Big Convergence # Virtual Threads (Project Loom) — Java 21 GA # This is the most architecturally significant Java feature since Java 5 concurrency utilities.\nPlatform threads are 1:1 with OS threads. They\u0026rsquo;re expensive (~1MB stack) and blocking them wastes resources. Traditional solutions: async/reactive programming (Reactor, RxJava) — powerful but complex to write, debug, and hire for.\nVirtual threads are JVM-managed, extremely lightweight (KBs). The JVM parks a virtual thread when it blocks on I/O and reassigns the carrier (OS) thread to another virtual thread. Result: you can have millions of virtual threads without exhausting OS resources.\n// Before: thread pool with 200 threads handling 200 concurrent requests ExecutorService pool = Executors.newFixedThreadPool(200); // After: one virtual thread per request, JVM handles the rest ExecutorService vThreadPool = Executors.newVirtualThreadPerTaskExecutor(); When virtual threads win: I/O-bound workloads — HTTP calls, database queries, file I/O. Thread-per-request model becomes viable even at high concurrency.\nWhen they don\u0026rsquo;t help: CPU-bound work. If your code is burning cycles, virtual threads don\u0026rsquo;t add parallelism — you\u0026rsquo;re still bound by CPU cores. Also, native code that parks a carrier thread (certain JDBC drivers, synchronized blocks) can \u0026ldquo;pin\u0026rdquo; virtual threads and negate the benefit.\nSpring Boot 3.2: spring.threads.virtual.enabled=true — Tomcat runs on virtual threads. Most teams can adopt this and get most of WebFlux\u0026rsquo;s throughput benefits with none of the reactive complexity.\nvs Reactive (WebFlux): Virtual threads win on simplicity and debuggability (normal stack traces). Reactive wins when you need backpressure, streaming, or are already invested in the reactive ecosystem.\nStructured Concurrency (StructuredTaskScope) — Preview # Treats concurrent tasks as a unit — if one fails, others are cancelled. Much cleaner error handling than CompletableFuture chains:\ntry (var scope = new StructuredTaskScope.ShutdownOnFailure()) { Future\u0026lt;User\u0026gt; user = scope.fork(() -\u0026gt; fetchUser(id)); Future\u0026lt;Orders\u0026gt; orders = scope.fork(() -\u0026gt; fetchOrders(id)); scope.join().throwIfFailed(); return new UserWithOrders(user.get(), orders.get()); } Scoped Values — Preview # Replacement for ThreadLocal in the virtual thread world. ThreadLocal can be problematic with virtual threads (inheritance semantics, memory leaks if not cleaned up). ScopedValue is immutable and bound to a scope:\nScopedValue\u0026lt;User\u0026gt; CURRENT_USER = ScopedValue.newInstance(); ScopedValue.where(CURRENT_USER, user).run(() -\u0026gt; handleRequest()); Pattern Matching for Switch (Java 21 GA) # Combines sealed classes + records + switch:\nString describe(Object obj) { return switch (obj) { case Integer i when i \u0026gt; 0 -\u0026gt; \u0026#34;positive int: \u0026#34; + i; case String s -\u0026gt; \u0026#34;string of length \u0026#34; + s.length(); case null -\u0026gt; \u0026#34;null\u0026#34;; default -\u0026gt; \u0026#34;other\u0026#34;; }; } SequencedCollection # Finally a common interface for ordered collections:\ninterface SequencedCollection\u0026lt;E\u0026gt; extends Collection\u0026lt;E\u0026gt; { E getFirst(); E getLast(); void addFirst(E e); void addLast(E e); E removeFirst(); E removeLast(); SequencedCollection\u0026lt;E\u0026gt; reversed(); } List, Deque, LinkedHashSet, LinkedHashMap now all share this interface.\nEM-Level Migration Discussion # If asked \u0026ldquo;how would you move a Java 8 codebase to 21?\u0026rdquo; the answer is incremental, bounded, tested:\nJava 11 first — LTS, low-risk. Fix deprecations (sun.* APIs), add var where it helps, adopt HttpClient. Java 17 next — LTS, sealed classes + records, switch expressions. Add module-info only if needed. Java 21 — Virtual threads is the prize. Enable in Spring Boot 3.2+. Test for pinning issues (-Djdk.tracePinnedThreads=full). At each step: automated test coverage is your safety net. No test coverage = no migration confidence.\n","date":"7 April 2026","externalUrl":null,"permalink":"/posts/java/java-8-to-21-language-features/","section":"Posts","summary":"Java has changed dramatically since Java 8. As an engineering manager, you don’t need to recite the JLS — but you do need to understand why these features exist, the trade-offs they carry, and how they affect the decisions your team makes every day. Here’s a curated tour.\n","title":"Java 8 to 21: Language Features Every EM Should Know","type":"posts"},{"content":"In Java programming, generics provide a way to create reusable classes, methods, and interfaces with type parameters. They allow us to design components that can work with any data type, providing type safety and flexibility. In this blog post, we will explore the use of generics in creating a data structure from scratch, emphasizing object-oriented programming principles and step-by-step explanations.\nUnderstanding Generics # Generics in Java enable us to define classes, interfaces, and methods with placeholder types. These types are specified when the component is used, allowing for flexibility and type safety at compile time. By using generics, we can create data structures that can store and manipulate various types of objects without sacrificing type safety.\nCreating a Generic Data Structure: LinkedList # Let\u0026rsquo;s consider the creation of a generic linked list data structure. LinkedList is a fundamental data structure consisting of nodes where each node contains data and a reference to the next node in the sequence. We will implement a simplified version of LinkedList using generics.\nStep 1: Designing the Node Class # The first step is to design the node class. Each node will hold a piece of data of type T and a reference to the next node.\npublic class Node\u0026lt;T\u0026gt; { private T data; private Node\u0026lt;T\u0026gt; next; public Node(T data) { this.data = data; this.next = null; } // Getters and setters for data and next } In the Node class, T represents the type of data the node will hold. We use \u0026lt;T\u0026gt; to indicate that it is a generic type.\nStep 2: Implementing the LinkedList Class # Next, we implement the LinkedList class, which will manage the nodes and provide operations to manipulate the list.\npublic class LinkedList\u0026lt;T\u0026gt; { private Node\u0026lt;T\u0026gt; head; public LinkedList() { this.head = null; } // Methods to add, remove, search, and traverse the list } In the LinkedList class, we use Node\u0026lt;T\u0026gt; to specify that the list will contain nodes holding data of type T.\nStep 3: Adding Functionality # We can now add functionality to our LinkedList class, including methods to add elements, remove elements, search for elements, and traverse the list.\npublic void add(T data) { Node\u0026lt;T\u0026gt; newNode = new Node\u0026lt;\u0026gt;(data); if (head == null) { head = newNode; } else { Node\u0026lt;T\u0026gt; current = head; while (current.getNext() != null) { current = current.getNext(); } current.setNext(newNode); } } // Other methods like remove, search, traverse Step 4: Using the Generic LinkedList # Finally, we can use our generic LinkedList to store and manipulate various types of data.\npublic static void main(String[] args) { LinkedList\u0026lt;Integer\u0026gt; integerList = new LinkedList\u0026lt;\u0026gt;(); integerList.add(5); integerList.add(10); LinkedList\u0026lt;String\u0026gt; stringList = new LinkedList\u0026lt;\u0026gt;(); stringList.add(\u0026#34;Hello\u0026#34;); stringList.add(\u0026#34;World\u0026#34;); } In the main method, we create instances of LinkedList with different data types (Integer and String) and add elements to them. Thanks to generics, the LinkedList class remains flexible and type-safe.\nConclusion # Generics in Java are a powerful feature that enables us to create reusable and type-safe components. By using generics, we can design data structures and algorithms that work with any data type, providing flexibility and type safety at compile time. In this blog post, we explored the use of generics in creating a generic LinkedList data structure from scratch, emphasizing object-oriented programming principles and step-by-step explanations. With generics, Java developers can write more robust and flexible code, enhancing code reusability and maintainability..\n","date":"27 October 2024","externalUrl":null,"permalink":"/posts/java/using-generics-for-datastructures/","section":"Posts","summary":"In Java programming, generics provide a way to create reusable classes, methods, and interfaces with type parameters. They allow us to design components that can work with any data type, providing type safety and flexibility. In this blog post, we will explore the use of generics in creating a data structure from scratch, emphasizing object-oriented programming principles and step-by-step explanations.\n","title":"Using Generics for Datastructures","type":"posts"},{"content":"System Design · Posts\nAbout nSkillHub # A passion-driven space for learning — system design, Java, Spring, and software engineering best practices. As a software engineer with years of experience, this blog shares insights, deep-dives, and interview prep material. Future topics will expand into movies, photography, travel, and more. Stay tuned!\nContact Lakshay on LinkedIn\n","externalUrl":null,"permalink":"/","section":"","summary":"System Design · Posts\nAbout nSkillHub # A passion-driven space for learning — system design, Java, Spring, and software engineering best practices. As a software engineer with years of experience, this blog shares insights, deep-dives, and interview prep material. Future topics will expand into movies, photography, travel, and more. Stay tuned!\n","title":"","type":"page"},{"content":"","externalUrl":null,"permalink":"/all-posts/","section":"","summary":"","title":"All Posts","type":"page"},{"content":"","externalUrl":null,"permalink":"/series/","section":"Series","summary":"","title":"Series","type":"series"}]