Scaling Strategies: A Decision Framework

Table of Contents

Scaling is not a synonym for “add more servers.” Each scaling lever has different costs, trade-offs, and appropriate circumstances. Reaching for the wrong one wastes money, adds complexity, or misses the actual bottleneck.

Vertical vs Horizontal: When Each Makes Sense
#

Vertical Scaling (Scale Up)
#

Add more CPU, RAM, or faster storage to the existing instance.

Vertical wins when:

You’re early stage and operational simplicity matters — one big instance is dramatically easier to operate than a distributed cluster
The workload is hard to parallelize (stateful, requires shared memory, complex coordination)
You have a single-node database that can’t shard easily — scaling vertical is often faster and safer than sharding
The cost per unit of performance is better vertical than horizontal at your current scale
You have a resource bottleneck (CPU-bound → more cores; memory-bound → more RAM) that’s clearly addressable vertically

Modern cloud instances are powerful. An r7g.16xlarge on AWS has 64 vCPUs and 512GB RAM. Many “distributed systems” problems are actually premature — a single well-specced Postgres instance handles more than teams think.

Vertical ceiling: Every instance has a maximum size. When you hit it, horizontal is the only option. Also, vertical scaling usually requires downtime (resize the instance).

Horizontal Scaling (Scale Out)
#

Add more instances behind a load balancer. The application must be stateless (or state must be externalized — Redis for sessions, S3 for uploads, DB for everything else).

Horizontal wins when:

The workload is parallelizable and stateless
You need high availability (if one instance dies, others serve traffic)
You’ve exhausted or are close to the vertical ceiling
You have autoscaling requirements (scale in/out dynamically with traffic)
Different components need to scale independently (API tier vs worker tier)

The Scaling Order: What to Reach for First
#

Given a scaling bottleneck, apply in this order. Each step costs less in complexity than the next.

1. Optimize first — profiling often reveals the real bottleneck.
   Missing index? N+1 query? Over-fetching? Fix it.

2. Vertical scaling — upgrade the instance. No code changes.

3. Caching — eliminate the bottleneck entirely for reads.
   A cache hit costs ~1ms vs 50ms DB query.

4. Read replicas — distribute read traffic.
   Works for read-heavy workloads (most are).

5. Connection pooling — PgBouncer/Hikari tuning.
   Often the bottleneck before the DB itself.

6. Asynchronous processing — offload work.
   Non-critical writes → queue → worker → DB.

7. Horizontal scaling of the application tier.
   Stateless services scale easily. Add pods.

8. Database sharding or distributed DB.
   Last resort. High complexity, high operational cost.

Don’t skip to step 8 because you’ve heard “at scale we’ll need sharding.” Most systems never reach that scale. Over-engineering for 10x-100x future load is the most common scaling mistake.

Read Replicas vs Caching vs Sharding
#

Read Replicas:

Copies of the database that serve reads. Primary handles writes.
Eventually consistent — replicas lag behind the primary (usually milliseconds, can be more under heavy load)
Works well when: most queries are reads, you don’t need read-after-write consistency on all reads
Cost: you pay for the replica instance. With Aurora you pay per read replica.
Limitation: writes still bottleneck at the primary

Caching:

Eliminates DB reads entirely for frequently accessed, cacheable data
Hit rate is the key metric — aim for > 90% for it to be worth it
Works well for: lookup data, computed results, session data, anything where the same query is repeated
Cost: Redis instance + cache invalidation complexity
Caching before read replicas often makes more sense — a cache hit is faster than a replica query, and the operational complexity is similar

Sharding:

Horizontal partitioning of the database. Data for user IDs 0-999999 goes to shard 1, 1000000-1999999 to shard 2.
Enables write scale-out — each shard handles a fraction of the write load
Massive operational complexity: cross-shard queries don’t exist (or require scatter-gather), resharding is painful, hot shards require rebalancing
Alternatives to hand-rolled sharding: Citus (Postgres extension), CockroachDB, PlanetScale (MySQL), Vitess
You probably don’t need this unless you have hundreds of thousands of writes per second

The Hot Partition Problem
#

In Kafka, DynamoDB, Cassandra, or any partitioned system: a “hot” partition receives disproportionate traffic while others are idle. This creates a bottleneck on a single node regardless of how many nodes you have.

Causes:

Partitioning by a low-cardinality key (partitioning an events table by event_type when 95% of events are PAGE_VIEW)
Celebrity / power user effect: one user’s data getting 1000x more traffic than average
Temporal patterns: partitioning by date and every write goes to today’s partition

Solutions:

Salting: Add a random suffix to the partition key (user_id_0, user_id_1, …, user_id_N). Distributes writes across N partitions. Reads must query all N and merge.
Write sharding with read-time aggregation: Write counters to multiple shards, sum at read time.
Application-level rate limiting: Limit writes to a hot user/entity at the application layer before they hit the data store.
Adaptive partitioning: Some systems (DynamoDB, Cosmos DB) auto-split hot partitions. Know whether your system supports this.

CPU-Bound vs I/O-Bound Scaling
#

I/O-bound services (waiting for DB, HTTP calls, disk):

Threads spend most time waiting, not executing
Horizontal scaling (more instances) helps — each instance handles more requests
Virtual threads (Java 21) or async I/O reduces the thread count needed
Read replicas and caching reduce the wait time per request

CPU-bound services (image processing, ML inference, cryptography, complex computation):

Threads are executing, not waiting
More cores = more throughput (vertical scale or more instances)
Virtual threads don’t help — CPU is the constraint, not thread scheduling
Consider: offload CPU-intensive work to dedicated workers, GPU instances for ML workloads, precomputation and caching of results

Autoscaling: What Metric to Scale On
#

The choice of scaling metric determines how well autoscaling responds to load.

CPU utilization (most common):

Works for CPU-bound services
Lags for I/O-bound services — threads are waiting, CPU is low, but latency is high
Scale trigger: CPU > 70% → add instances

Request queue depth / pending messages:

Better for queue consumer workers
“When the queue has > 1000 messages, add consumers”
Direct signal that work is backing up

Custom business metrics:

Scale on “requests in flight” or “P95 latency > 200ms”
Requires custom metrics export (Prometheus → KEDA, CloudWatch → ASG)
Most accurate but requires instrumentation

Memory utilization:

Rarely the right primary scaling metric (memory doesn’t correlate with load the same way)
Useful as a ceiling alarm (OOM prevention), not a scale trigger

Best practice: For API services, scale on CPU + request rate. For async workers, scale on queue depth. Set minimum instances high enough to handle baseline load without cold-start latency on scale-out. Test autoscaling behavior with load tests — not just at steady state but at scale-up and scale-down transitions.

Vertical vs Horizontal: When Each Makes Sense #

Vertical Scaling (Scale Up) #

Horizontal Scaling (Scale Out) #

The Scaling Order: What to Reach for First #

Read Replicas vs Caching vs Sharding #

The Hot Partition Problem #

CPU-Bound vs I/O-Bound Scaling #

Autoscaling: What Metric to Scale On #