System Design: Rate Limiter
A rate limiter controls how many requests a client can make in a time window. It protects services from abuse, ensures fair usage, and prevents resource exhaustion. The design question tests your understanding of algorithms, distributed systems consistency, and where infrastructure concerns belong.
Functional:
- Limit requests per user/API key/IP to N requests per M seconds
- Different limits for different endpoints or tiers (free: 100/hour, paid: 10,000/hour)
- Return
429 Too Many RequestswithRetry-Afterheader when limit exceeded - Soft limits: allow bursting slightly above limit, then throttle
- Limit by: user ID, API key, IP address, or combination
Non-functional:
- Rate limit decision < 5ms overhead (must not noticeably add latency)
- Accurate across multiple service instances (distributed)
- Graceful degradation if the rate limit store is unavailable (fail open)
- High availability — rate limiter must not be a single point of failure
Divide time into fixed windows (e.g., each minute). Count requests in the current window.
Window: 12:00:00 - 12:00:59
Request count: 47/100 → allow
Window: 12:01:00 - 12:01:59 → reset to 0
Problem: Boundary spike. A user sends 100 requests at 12:00:58 (within limit) and 100 more at 12:01:00 (new window). In 2 seconds, 200 requests go through, double the intended limit.
When to use: Simple to implement, acceptable when boundary spikes aren’t a concern.
Store a timestamp for every request in the last window. Count timestamps in the window [now - window_size, now].
Timestamps: [12:00:10, 12:00:30, 12:00:45, ...]
On new request at 12:01:15:
Remove all timestamps < 12:00:15 (older than 1 minute)
Count remaining: if < limit → allow and add timestamp
Pros: Accurate, no boundary spikes. Cons: Storage is O(requests per window). For 1000 req/minute per user with 10M users = 10B entries. Not scalable for high-rate scenarios.
Hybrid approach. Maintain a count for the current window and a weighted count from the previous window.
Previous window count: 80 requests
Current window count: 25 requests (30% through the window)
Weighted estimate: 80 * (1 - 0.3) + 25 = 56 + 25 = 81 requests in the "sliding" window
This approximates the sliding window with O(1) storage per user. Used in practice by most rate limiters (Nginx, Kong, Redis-based implementations).
A “bucket” holds up to N tokens. Tokens are added at a constant rate (refill rate). Each request consumes one token. If the bucket is empty, the request is rejected.
Bucket capacity: 100 tokens
Refill rate: 10 tokens/second
Current tokens: 50
Request arrives:
tokens > 0 → consume 1 → allow (now 49 tokens)
tokens = 0 → reject (429)
Key property: Allows bursting up to bucket capacity, then smooths to the refill rate. This is usually the desired behavior — a user can make 100 rapid requests as a burst, then is limited to 10/sec ongoing.
Storage per user: Last refill time + current token count. O(1). Efficient.
Implementation in Redis:
-- Lua script (atomic execution in Redis)
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2]) -- tokens per second
local now = tonumber(ARGV[3]) -- current timestamp (seconds)
local requested = tonumber(ARGV[4]) -- tokens to consume (usually 1)
local data = redis.call("HMGET", key, "tokens", "last_refill")
local tokens = tonumber(data[1]) or capacity
local last_refill = tonumber(data[2]) or now
-- Refill tokens based on elapsed time
local elapsed = now - last_refill
local new_tokens = math.min(capacity, tokens + elapsed * refill_rate)
if new_tokens >= requested then
-- Allow: consume tokens
redis.call("HMSET", key, "tokens", new_tokens - requested, "last_refill", now)
redis.call("EXPIRE", key, math.ceil(capacity / refill_rate) + 1)
return 1 -- allowed
else
-- Reject
redis.call("HMSET", key, "tokens", new_tokens, "last_refill", now)
return 0 -- rejected
end
Why Lua scripts? Redis executes Lua scripts atomically — no other command can run between the read and write. This prevents race conditions (two concurrent requests both reading 1 token remaining, both allowing, resulting in -1 tokens).
Requests are added to a queue (the “bucket”) and processed at a constant rate. If the bucket overflows, requests are rejected.
Analogy: A bucket with a hole — water drains at a constant rate regardless of how fast it fills.
Difference from token bucket: Token bucket allows bursting (many requests rapidly consume the accumulated tokens). Leaky bucket smooths output to a constant rate — there’s no burst. Good for rate-limiting outgoing requests from your service to external APIs.
The hard part: with 10 service instances, each needs to share rate limit state. If each instance has its own counter, a user can get N * 10 requests through.
Centralized Redis approach:
All instances read/write the same Redis counter. The Lua script above ensures atomicity.
Instance 1 ──→
Instance 2 ──→ Redis (shared state) ← Lua script for atomic CAS
Instance 3 ──→
Pros: Accurate, relatively simple. Cons: Redis is a single point of failure (mitigate with Redis Cluster/Sentinel). Every request adds ~1ms Redis latency.
Optimistic local approximation:
Each instance maintains a local counter. Periodically (every 100ms) syncs with Redis (adds its local increment, reads the global total). Slightly inaccurate (may allow a few extra requests between syncs) but much lower latency and Redis load.
Pros: Near-zero overhead per request (local counter), reduces Redis traffic by 100x. Cons: Can allow more requests than the limit during sync intervals.
API Gateway / Edge:
- Best for protecting the entire platform
- No code changes in services
- Rate limit by API key, IP, or user header
- Tools: Nginx (limit_req), Kong, AWS API Gateway, Cloudflare
Application layer (per-service):
- Rate limit at more granular level (specific endpoints, specific operations)
- Access to more context (user tier, specific action)
- Resilience4j
@RateLimiteror Spring annotations
Edge + application combined: Edge limits handle gross abuse; application limits enforce business rules (e.g., “max 5 password reset emails per hour per user”).
Client → CDN (Cloudflare rate limiting: block DDoS)
→ API Gateway (coarse rate limiting: 1000 req/min per API key)
→ Service (fine-grained: 10 password reset req/hour per user)
│
Rate Limiter Middleware
│
Redis Cluster (shared state, Lua scripts)
Redis unavailable:
- Option 1: Fail open — allow all requests. Temporarily remove rate limiting protection.
- Option 2: Fail closed — reject all requests. No requests get through.
- Option 3: Local fallback — use local in-process counter without Redis sync. Approximate rate limiting still active.
Recommendation: Fail open for most rate limiters (availability > strict limiting). For security-sensitive limits (login attempts, password reset), fail closed.
Race condition without Lua:
Read: tokens = 1
[concurrent request reads: tokens = 1]
Write: tokens = 0
[concurrent request writes: tokens = 0]
-- Both requests allowed, but only 1 token was available
Always use atomic operations (Lua script, Redis INCR with check, or a compare-and-swap loop).
- Token bucket vs leaky bucket: Token bucket = burst + smooth limit (for incoming requests to your API). Leaky bucket = constant outflow rate (for your requests to external APIs). Most user-facing rate limiters use token bucket.
- Header convention:
X-RateLimit-Limit,X-RateLimit-Remaining,X-RateLimit-Reset,Retry-After— return these so clients can implement back-off correctly. - Sliding window counter vs token bucket: Token bucket is slightly more accurate for burst scenarios. Sliding window counter is conceptually simpler. Both are production-viable.
- Partitioned Redis keys: Rate limit key =
rate_limit:{user_id}:{endpoint}:{window}. Include endpoint if limits differ per endpoint. Include user tier in the limit value lookup. - At what scale does centralized Redis break? Redis can handle 100k+ operations/second. At 1M req/sec, Redis itself becomes the bottleneck — need Redis Cluster or the local approximation approach.