Scaling is not a synonym for “add more servers.” Each scaling lever has different costs, trade-offs, and appropriate circumstances. Reaching for the wrong one wastes money, adds complexity, or misses the actual bottleneck.
Vertical vs Horizontal: When Each Makes Sense #
Vertical Scaling (Scale Up) #
Add more CPU, RAM, or faster storage to the existing instance.
Vertical wins when:
- You’re early stage and operational simplicity matters — one big instance is dramatically easier to operate than a distributed cluster
- The workload is hard to parallelize (stateful, requires shared memory, complex coordination)
- You have a single-node database that can’t shard easily — scaling vertical is often faster and safer than sharding
- The cost per unit of performance is better vertical than horizontal at your current scale
- You have a resource bottleneck (CPU-bound → more cores; memory-bound → more RAM) that’s clearly addressable vertically
Modern cloud instances are powerful. An r7g.16xlarge on AWS has 64 vCPUs and 512GB RAM. Many “distributed systems” problems are actually premature — a single well-specced Postgres instance handles more than teams think.
Vertical ceiling: Every instance has a maximum size. When you hit it, horizontal is the only option. Also, vertical scaling usually requires downtime (resize the instance).
Horizontal Scaling (Scale Out) #
Add more instances behind a load balancer. The application must be stateless (or state must be externalized — Redis for sessions, S3 for uploads, DB for everything else).
Horizontal wins when:
- The workload is parallelizable and stateless
- You need high availability (if one instance dies, others serve traffic)
- You’ve exhausted or are close to the vertical ceiling
- You have autoscaling requirements (scale in/out dynamically with traffic)
- Different components need to scale independently (API tier vs worker tier)
The Scaling Order: What to Reach for First #
Given a scaling bottleneck, apply in this order. Each step costs less in complexity than the next.
1. Optimize first — profiling often reveals the real bottleneck.
Missing index? N+1 query? Over-fetching? Fix it.
2. Vertical scaling — upgrade the instance. No code changes.
3. Caching — eliminate the bottleneck entirely for reads.
A cache hit costs ~1ms vs 50ms DB query.
4. Read replicas — distribute read traffic.
Works for read-heavy workloads (most are).
5. Connection pooling — PgBouncer/Hikari tuning.
Often the bottleneck before the DB itself.
6. Asynchronous processing — offload work.
Non-critical writes → queue → worker → DB.
7. Horizontal scaling of the application tier.
Stateless services scale easily. Add pods.
8. Database sharding or distributed DB.
Last resort. High complexity, high operational cost.Don’t skip to step 8 because you’ve heard “at scale we’ll need sharding.” Most systems never reach that scale. Over-engineering for 10x-100x future load is the most common scaling mistake.
Read Replicas vs Caching vs Sharding #
Read Replicas:
- Copies of the database that serve reads. Primary handles writes.
- Eventually consistent — replicas lag behind the primary (usually milliseconds, can be more under heavy load)
- Works well when: most queries are reads, you don’t need read-after-write consistency on all reads
- Cost: you pay for the replica instance. With Aurora you pay per read replica.
- Limitation: writes still bottleneck at the primary
Caching:
- Eliminates DB reads entirely for frequently accessed, cacheable data
- Hit rate is the key metric — aim for > 90% for it to be worth it
- Works well for: lookup data, computed results, session data, anything where the same query is repeated
- Cost: Redis instance + cache invalidation complexity
- Caching before read replicas often makes more sense — a cache hit is faster than a replica query, and the operational complexity is similar
Sharding:
- Horizontal partitioning of the database. Data for user IDs 0-999999 goes to shard 1, 1000000-1999999 to shard 2.
- Enables write scale-out — each shard handles a fraction of the write load
- Massive operational complexity: cross-shard queries don’t exist (or require scatter-gather), resharding is painful, hot shards require rebalancing
- Alternatives to hand-rolled sharding: Citus (Postgres extension), CockroachDB, PlanetScale (MySQL), Vitess
- You probably don’t need this unless you have hundreds of thousands of writes per second
The Hot Partition Problem #
In Kafka, DynamoDB, Cassandra, or any partitioned system: a “hot” partition receives disproportionate traffic while others are idle. This creates a bottleneck on a single node regardless of how many nodes you have.
Causes:
- Partitioning by a low-cardinality key (partitioning an events table by
event_typewhen 95% of events arePAGE_VIEW) - Celebrity / power user effect: one user’s data getting 1000x more traffic than average
- Temporal patterns: partitioning by date and every write goes to today’s partition
Solutions:
- Salting: Add a random suffix to the partition key (
user_id_0,user_id_1, …,user_id_N). Distributes writes across N partitions. Reads must query all N and merge. - Write sharding with read-time aggregation: Write counters to multiple shards, sum at read time.
- Application-level rate limiting: Limit writes to a hot user/entity at the application layer before they hit the data store.
- Adaptive partitioning: Some systems (DynamoDB, Cosmos DB) auto-split hot partitions. Know whether your system supports this.
CPU-Bound vs I/O-Bound Scaling #
I/O-bound services (waiting for DB, HTTP calls, disk):
- Threads spend most time waiting, not executing
- Horizontal scaling (more instances) helps — each instance handles more requests
- Virtual threads (Java 21) or async I/O reduces the thread count needed
- Read replicas and caching reduce the wait time per request
CPU-bound services (image processing, ML inference, cryptography, complex computation):
- Threads are executing, not waiting
- More cores = more throughput (vertical scale or more instances)
- Virtual threads don’t help — CPU is the constraint, not thread scheduling
- Consider: offload CPU-intensive work to dedicated workers, GPU instances for ML workloads, precomputation and caching of results
Autoscaling: What Metric to Scale On #
The choice of scaling metric determines how well autoscaling responds to load.
CPU utilization (most common):
- Works for CPU-bound services
- Lags for I/O-bound services — threads are waiting, CPU is low, but latency is high
- Scale trigger: CPU > 70% → add instances
Request queue depth / pending messages:
- Better for queue consumer workers
- “When the queue has > 1000 messages, add consumers”
- Direct signal that work is backing up
Custom business metrics:
- Scale on “requests in flight” or “P95 latency > 200ms”
- Requires custom metrics export (Prometheus → KEDA, CloudWatch → ASG)
- Most accurate but requires instrumentation
Memory utilization:
- Rarely the right primary scaling metric (memory doesn’t correlate with load the same way)
- Useful as a ceiling alarm (OOM prevention), not a scale trigger
Best practice: For API services, scale on CPU + request rate. For async workers, scale on queue depth. Set minimum instances high enough to handle baseline load without cold-start latency on scale-out. Test autoscaling behavior with load tests — not just at steady state but at scale-up and scale-down transitions.