System Design: LRU and LFU Cache

LRU and LFU cache design is a classic interview question that combines data structures with systems thinking. Most candidates implement the data structure correctly; the EM-level discussion extends into eviction trade-offs, distributed cache behavior, and when to use each. Here’s the complete picture. LRU Cache (Least Recently Used) Eviction policy: When the cache is full, evict the item that was accessed least recently. The insight: Temporal locality — recently accessed items are more likely to be accessed again soon.

Read full post

System Design: Rate Limiter

A rate limiter controls how many requests a client can make in a time window. It protects services from abuse, ensures fair usage, and prevents resource exhaustion. The design question tests your understanding of algorithms, distributed systems consistency, and where infrastructure concerns belong. Requirements Functional: Limit requests per user/API key/IP to N requests per M seconds Different limits for different endpoints or tiers (free: 100/hour, paid: 10,000/hour) Return 429 Too Many Requests with Retry-After header when limit exceeded Soft limits: allow bursting slightly above limit, then throttle Limit by: user ID, API key, IP address, or combination Non-functional:

Read full post

System Design: Digital Wallet

A digital wallet (think PayPal, Apple Pay balance, in-app credits) is a system where correctness is non-negotiable. The design challenge: every debit must be atomic, concurrent operations must not cause double-spending, and the balance you show must always be consistent with the transaction history. Requirements Functional: User has a wallet with a balance Deposit funds (from card, bank, another wallet) Withdraw funds (to bank, payment for goods/services) Transfer between wallets (peer-to-peer) View balance and transaction history Multi-currency (USD, EUR, etc.

Read full post

System Design: Fraud Detection System

Fraud detection sits at the intersection of real-time systems, machine learning serving, and operational decision-making. The design challenge is: you need fast decisions (before the transaction completes) with high accuracy (false positives cost customers, false negatives cost money). Here’s how to design for both. Requirements Functional: Score every transaction for fraud risk before authorization (synchronous, < 200ms) Async deeper analysis for flagged transactions (minutes to hours) Rule-based engine: “block if transaction > $5000 AND new device AND new country” ML scoring: multi-feature risk probability score Case management: analysts review flagged cases, mark fraud/not fraud Feedback loop: analyst decisions feed back into model training Account takeover detection (ATO): suspicious login, device fingerprint, velocity Non-functional:

Read full post

System Design: Payment Gateway

A payment gateway design is one of the highest-stakes system design questions. Every architectural decision has a financial consequence — double charges, lost transactions, or fraud exposure. The interviewer is testing your understanding of consistency, idempotency, and the realities of financial systems. Requirements Functional: Accept payment instrument (card, bank account, wallet) and charge it Support 3DS (3D Secure) authentication flow for card payments Async settlement — funds eventually transferred to merchant Refunds and partial refunds Transaction status API — client can poll for async results Webhook notifications on payment state changes Idempotent payment requests — retrying doesn’t double-charge Non-functional:

Read full post

System Design: URL Shortener

A URL shortener is a classic system design question. It seems simple — but the interviewer is using it to probe your decisions on hashing, database design, caching, and scaling reads. Here’s the complete design. Requirements Functional: Given a long URL, generate a short code (e.g., bit.ly/abc123) Given a short code, redirect to the original URL Custom slugs (user-defined: bit.ly/my-company) Analytics: click counts, unique visitors, referrer, geo Link expiration Non-functional:

Read full post

Microservices Patterns: Saga, CQRS, Event Sourcing, BFF, and More

Microservices patterns are the vocabulary of distributed systems design. Knowing when to apply each one — and when not to — separates an architect who reads pattern books from one who’s shipped production systems. Saga Pattern Problem: A business transaction spans multiple services, each with its own database. You can’t use a distributed ACID transaction. Solution: A saga is a sequence of local transactions. Each step publishes an event or triggers the next step.

Read full post

Engineering Leadership Trade-offs: Build vs Buy, Tech Debt, and Rewrite vs Refactor

EM interviews often end with “the harder framing” — questions about judgment, decision-making under pressure, and how you navigate disagreement. These don’t have right answers; they have reasoned answers that demonstrate how you think. Here’s a framework for the most common ones. Build vs Buy The question sounds simple; the answer has layers. The framework: Build when: This is a core differentiator — it’s what your product does, and doing it better than a vendor is a competitive advantage The off-the-shelf solution is a poor fit (you’d spend more customizing than building) Data or security requirements make a third-party solution unacceptable (regulated industries, data residency) The vendor is a single point of failure for your core business Buy when:

Read full post

Data Pipeline and Analytics: OLTP vs OLAP, Batch vs Streaming, CDC

As systems grow, the gap between operational data (what your application uses to run) and analytical data (what your business uses to make decisions) becomes significant. Understanding how to design data pipelines that bridge this gap is an EM-level concern. OLTP vs OLAP: Fundamentally Different Read Patterns OLTP (Online Transaction Processing): Handles operational workload — your application’s reads and writes Optimized for: fast, low-latency reads and writes on individual rows or small sets Schema design: normalized (3NF) to minimize write anomalies Example queries: “Get user #12345”, “Insert new order”, “Update inventory for SKU ABC” Database: PostgreSQL, MySQL, DynamoDB OLAP (Online Analytical Processing):

Read full post

Testing Strategy: Test Pyramid, Contract Testing, and Coverage Pragmatics

Testing strategy is an EM-level concern because it directly affects delivery velocity, production reliability, and onboarding speed. Too little testing = production incidents. Too much ceremony = slow CI and frustrated engineers. The goal is the right tests in the right places. The Test Pyramid for Microservices The classic test pyramid has unit tests at the base, integration tests in the middle, and end-to-end tests at the top. In microservices, the pyramid shifts slightly because the “integration” layer is where most of the real risk lives.

Read full post