nSkillHub
Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Back to homepage

Scaling Strategies: A Decision Framework

Scaling is not a synonym for “add more servers.” Each scaling lever has different costs, trade-offs, and appropriate circumstances. Reaching for the wrong one wastes money, adds complexity, or misses the actual bottleneck.


Vertical vs Horizontal: When Each Makes Sense

Vertical Scaling (Scale Up)

Add more CPU, RAM, or faster storage to the existing instance.

Vertical wins when:

  • You’re early stage and operational simplicity matters — one big instance is dramatically easier to operate than a distributed cluster
  • The workload is hard to parallelize (stateful, requires shared memory, complex coordination)
  • You have a single-node database that can’t shard easily — scaling vertical is often faster and safer than sharding
  • The cost per unit of performance is better vertical than horizontal at your current scale
  • You have a resource bottleneck (CPU-bound → more cores; memory-bound → more RAM) that’s clearly addressable vertically

Modern cloud instances are powerful. An r7g.16xlarge on AWS has 64 vCPUs and 512GB RAM. Many “distributed systems” problems are actually premature — a single well-specced Postgres instance handles more than teams think.

Vertical ceiling: Every instance has a maximum size. When you hit it, horizontal is the only option. Also, vertical scaling usually requires downtime (resize the instance).

Horizontal Scaling (Scale Out)

Add more instances behind a load balancer. The application must be stateless (or state must be externalized — Redis for sessions, S3 for uploads, DB for everything else).

Horizontal wins when:

  • The workload is parallelizable and stateless
  • You need high availability (if one instance dies, others serve traffic)
  • You’ve exhausted or are close to the vertical ceiling
  • You have autoscaling requirements (scale in/out dynamically with traffic)
  • Different components need to scale independently (API tier vs worker tier)

The Scaling Order: What to Reach for First

Given a scaling bottleneck, apply in this order. Each step costs less in complexity than the next.

1. Optimize first — profiling often reveals the real bottleneck.
   Missing index? N+1 query? Over-fetching? Fix it.

2. Vertical scaling — upgrade the instance. No code changes.

3. Caching — eliminate the bottleneck entirely for reads.
   A cache hit costs ~1ms vs 50ms DB query.

4. Read replicas — distribute read traffic.
   Works for read-heavy workloads (most are).

5. Connection pooling — PgBouncer/Hikari tuning.
   Often the bottleneck before the DB itself.

6. Asynchronous processing — offload work.
   Non-critical writes → queue → worker → DB.

7. Horizontal scaling of the application tier.
   Stateless services scale easily. Add pods.

8. Database sharding or distributed DB.
   Last resort. High complexity, high operational cost.

Don’t skip to step 8 because you’ve heard “at scale we’ll need sharding.” Most systems never reach that scale. Over-engineering for 10x-100x future load is the most common scaling mistake.


Read Replicas vs Caching vs Sharding

Read Replicas:

  • Copies of the database that serve reads. Primary handles writes.
  • Eventually consistent — replicas lag behind the primary (usually milliseconds, can be more under heavy load)
  • Works well when: most queries are reads, you don’t need read-after-write consistency on all reads
  • Cost: you pay for the replica instance. With Aurora you pay per read replica.
  • Limitation: writes still bottleneck at the primary

Caching:

  • Eliminates DB reads entirely for frequently accessed, cacheable data
  • Hit rate is the key metric — aim for > 90% for it to be worth it
  • Works well for: lookup data, computed results, session data, anything where the same query is repeated
  • Cost: Redis instance + cache invalidation complexity
  • Caching before read replicas often makes more sense — a cache hit is faster than a replica query, and the operational complexity is similar

Sharding:

  • Horizontal partitioning of the database. Data for user IDs 0-999999 goes to shard 1, 1000000-1999999 to shard 2.
  • Enables write scale-out — each shard handles a fraction of the write load
  • Massive operational complexity: cross-shard queries don’t exist (or require scatter-gather), resharding is painful, hot shards require rebalancing
  • Alternatives to hand-rolled sharding: Citus (Postgres extension), CockroachDB, PlanetScale (MySQL), Vitess
  • You probably don’t need this unless you have hundreds of thousands of writes per second

The Hot Partition Problem

In Kafka, DynamoDB, Cassandra, or any partitioned system: a “hot” partition receives disproportionate traffic while others are idle. This creates a bottleneck on a single node regardless of how many nodes you have.

Causes:

  • Partitioning by a low-cardinality key (partitioning an events table by event_type when 95% of events are PAGE_VIEW)
  • Celebrity / power user effect: one user’s data getting 1000x more traffic than average
  • Temporal patterns: partitioning by date and every write goes to today’s partition

Solutions:

  • Salting: Add a random suffix to the partition key (user_id_0, user_id_1, …, user_id_N). Distributes writes across N partitions. Reads must query all N and merge.
  • Write sharding with read-time aggregation: Write counters to multiple shards, sum at read time.
  • Application-level rate limiting: Limit writes to a hot user/entity at the application layer before they hit the data store.
  • Adaptive partitioning: Some systems (DynamoDB, Cosmos DB) auto-split hot partitions. Know whether your system supports this.

CPU-Bound vs I/O-Bound Scaling

I/O-bound services (waiting for DB, HTTP calls, disk):

  • Threads spend most time waiting, not executing
  • Horizontal scaling (more instances) helps — each instance handles more requests
  • Virtual threads (Java 21) or async I/O reduces the thread count needed
  • Read replicas and caching reduce the wait time per request

CPU-bound services (image processing, ML inference, cryptography, complex computation):

  • Threads are executing, not waiting
  • More cores = more throughput (vertical scale or more instances)
  • Virtual threads don’t help — CPU is the constraint, not thread scheduling
  • Consider: offload CPU-intensive work to dedicated workers, GPU instances for ML workloads, precomputation and caching of results

Autoscaling: What Metric to Scale On

The choice of scaling metric determines how well autoscaling responds to load.

CPU utilization (most common):

  • Works for CPU-bound services
  • Lags for I/O-bound services — threads are waiting, CPU is low, but latency is high
  • Scale trigger: CPU > 70% → add instances

Request queue depth / pending messages:

  • Better for queue consumer workers
  • “When the queue has > 1000 messages, add consumers”
  • Direct signal that work is backing up

Custom business metrics:

  • Scale on “requests in flight” or “P95 latency > 200ms”
  • Requires custom metrics export (Prometheus → KEDA, CloudWatch → ASG)
  • Most accurate but requires instrumentation

Memory utilization:

  • Rarely the right primary scaling metric (memory doesn’t correlate with load the same way)
  • Useful as a ceiling alarm (OOM prevention), not a scale trigger

Best practice: For API services, scale on CPU + request rate. For async workers, scale on queue depth. Set minimum instances high enough to handle baseline load without cold-start latency on scale-out. Test autoscaling behavior with load tests — not just at steady state but at scale-up and scale-down transitions.