Scaling Strategies: A Decision Framework
Scaling is not a synonym for “add more servers.” Each scaling lever has different costs, trade-offs, and appropriate circumstances. Reaching for the wrong one wastes money, adds complexity, or misses the actual bottleneck.
Add more CPU, RAM, or faster storage to the existing instance.
Vertical wins when:
- You’re early stage and operational simplicity matters — one big instance is dramatically easier to operate than a distributed cluster
- The workload is hard to parallelize (stateful, requires shared memory, complex coordination)
- You have a single-node database that can’t shard easily — scaling vertical is often faster and safer than sharding
- The cost per unit of performance is better vertical than horizontal at your current scale
- You have a resource bottleneck (CPU-bound → more cores; memory-bound → more RAM) that’s clearly addressable vertically
Modern cloud instances are powerful. An r7g.16xlarge on AWS has 64 vCPUs and 512GB RAM. Many “distributed systems” problems are actually premature — a single well-specced Postgres instance handles more than teams think.
Vertical ceiling: Every instance has a maximum size. When you hit it, horizontal is the only option. Also, vertical scaling usually requires downtime (resize the instance).
Add more instances behind a load balancer. The application must be stateless (or state must be externalized — Redis for sessions, S3 for uploads, DB for everything else).
Horizontal wins when:
- The workload is parallelizable and stateless
- You need high availability (if one instance dies, others serve traffic)
- You’ve exhausted or are close to the vertical ceiling
- You have autoscaling requirements (scale in/out dynamically with traffic)
- Different components need to scale independently (API tier vs worker tier)
Given a scaling bottleneck, apply in this order. Each step costs less in complexity than the next.
1. Optimize first — profiling often reveals the real bottleneck.
Missing index? N+1 query? Over-fetching? Fix it.
2. Vertical scaling — upgrade the instance. No code changes.
3. Caching — eliminate the bottleneck entirely for reads.
A cache hit costs ~1ms vs 50ms DB query.
4. Read replicas — distribute read traffic.
Works for read-heavy workloads (most are).
5. Connection pooling — PgBouncer/Hikari tuning.
Often the bottleneck before the DB itself.
6. Asynchronous processing — offload work.
Non-critical writes → queue → worker → DB.
7. Horizontal scaling of the application tier.
Stateless services scale easily. Add pods.
8. Database sharding or distributed DB.
Last resort. High complexity, high operational cost.
Don’t skip to step 8 because you’ve heard “at scale we’ll need sharding.” Most systems never reach that scale. Over-engineering for 10x-100x future load is the most common scaling mistake.
Read Replicas:
- Copies of the database that serve reads. Primary handles writes.
- Eventually consistent — replicas lag behind the primary (usually milliseconds, can be more under heavy load)
- Works well when: most queries are reads, you don’t need read-after-write consistency on all reads
- Cost: you pay for the replica instance. With Aurora you pay per read replica.
- Limitation: writes still bottleneck at the primary
Caching:
- Eliminates DB reads entirely for frequently accessed, cacheable data
- Hit rate is the key metric — aim for > 90% for it to be worth it
- Works well for: lookup data, computed results, session data, anything where the same query is repeated
- Cost: Redis instance + cache invalidation complexity
- Caching before read replicas often makes more sense — a cache hit is faster than a replica query, and the operational complexity is similar
Sharding:
- Horizontal partitioning of the database. Data for user IDs 0-999999 goes to shard 1, 1000000-1999999 to shard 2.
- Enables write scale-out — each shard handles a fraction of the write load
- Massive operational complexity: cross-shard queries don’t exist (or require scatter-gather), resharding is painful, hot shards require rebalancing
- Alternatives to hand-rolled sharding: Citus (Postgres extension), CockroachDB, PlanetScale (MySQL), Vitess
- You probably don’t need this unless you have hundreds of thousands of writes per second
In Kafka, DynamoDB, Cassandra, or any partitioned system: a “hot” partition receives disproportionate traffic while others are idle. This creates a bottleneck on a single node regardless of how many nodes you have.
Causes:
- Partitioning by a low-cardinality key (partitioning an events table by
event_typewhen 95% of events arePAGE_VIEW) - Celebrity / power user effect: one user’s data getting 1000x more traffic than average
- Temporal patterns: partitioning by date and every write goes to today’s partition
Solutions:
- Salting: Add a random suffix to the partition key (
user_id_0,user_id_1, …,user_id_N). Distributes writes across N partitions. Reads must query all N and merge. - Write sharding with read-time aggregation: Write counters to multiple shards, sum at read time.
- Application-level rate limiting: Limit writes to a hot user/entity at the application layer before they hit the data store.
- Adaptive partitioning: Some systems (DynamoDB, Cosmos DB) auto-split hot partitions. Know whether your system supports this.
I/O-bound services (waiting for DB, HTTP calls, disk):
- Threads spend most time waiting, not executing
- Horizontal scaling (more instances) helps — each instance handles more requests
- Virtual threads (Java 21) or async I/O reduces the thread count needed
- Read replicas and caching reduce the wait time per request
CPU-bound services (image processing, ML inference, cryptography, complex computation):
- Threads are executing, not waiting
- More cores = more throughput (vertical scale or more instances)
- Virtual threads don’t help — CPU is the constraint, not thread scheduling
- Consider: offload CPU-intensive work to dedicated workers, GPU instances for ML workloads, precomputation and caching of results
The choice of scaling metric determines how well autoscaling responds to load.
CPU utilization (most common):
- Works for CPU-bound services
- Lags for I/O-bound services — threads are waiting, CPU is low, but latency is high
- Scale trigger: CPU > 70% → add instances
Request queue depth / pending messages:
- Better for queue consumer workers
- “When the queue has > 1000 messages, add consumers”
- Direct signal that work is backing up
Custom business metrics:
- Scale on “requests in flight” or “P95 latency > 200ms”
- Requires custom metrics export (Prometheus → KEDA, CloudWatch → ASG)
- Most accurate but requires instrumentation
Memory utilization:
- Rarely the right primary scaling metric (memory doesn’t correlate with load the same way)
- Useful as a ceiling alarm (OOM prevention), not a scale trigger
Best practice: For API services, scale on CPU + request rate. For async workers, scale on queue depth. Set minimum instances high enough to handle baseline load without cold-start latency on scale-out. Test autoscaling behavior with load tests — not just at steady state but at scale-up and scale-down transitions.