How you deploy code is as important as how you write it. The gap between writing a feature and it running in production reliably is where most engineering organizations lose velocity. This post covers the decisions that shape that gap.
Trunk-Based Development vs GitFlow GitFlow Long-lived branches: main, develop, feature branches, release branches, hotfix branches. Features are developed on branches, merged to develop, periodically merged to release branches, then to main.
Cloud infrastructure decisions are often more political than technical. The right answer depends on where your team’s expertise is, what your customers require, and what you’re willing to operate. Here’s how to frame these decisions at the EM level.
AWS vs GCP vs Azure: Does It Actually Matter? For most workloads, the difference between the big three is smaller than the cloud marketing suggests. Compute (VMs, containers, managed Kubernetes) is broadly equivalent.
Security architecture decisions have higher stakes than most — the cost of getting them wrong is a data breach, not a performance degradation. This post covers the trade-offs that come up in EM-level interviews: authentication approaches, identity protocols, and secrets management.
Session-Based vs JWT: The Real Trade-offs Both are valid. The choice depends on your consistency requirements and architecture.
Session-Based Authentication The server stores session state. On login, the server creates a session record (in DB or Redis) and sends a session cookie.
Observability is the ability to understand what’s happening inside your system from the outside — from its outputs. The three pillars (logs, metrics, traces) are complementary tools, each answering different questions. Getting the combination right is what separates systems that you can reason about from systems that require tribal knowledge to debug.
Logs vs Metrics vs Traces: What Each Gives You Logs Logs are the raw record of events — timestamped, structured or unstructured, per-request or system-level.
Reliability isn’t about preventing failures — it’s about building systems that fail gracefully, recover quickly, and maintain user trust even when things go wrong. This post covers the patterns that keep systems running under degraded conditions.
The Resilience Toolkit Timeout Set a maximum time to wait for any external call. Without timeouts, a slow dependency causes your threads to pile up waiting, eventually exhausting your thread pool.
Connection timeout: how long to wait to establish a connection Read timeout: how long to wait for data once connected Overall timeout: max end-to-end time (often the most important) Common mistake: Setting timeouts too conservatively (tight) causes spurious failures.
Scaling is not a synonym for “add more servers.” Each scaling lever has different costs, trade-offs, and appropriate circumstances. Reaching for the wrong one wastes money, adds complexity, or misses the actual bottleneck.
Vertical vs Horizontal: When Each Makes Sense Vertical Scaling (Scale Up) Add more CPU, RAM, or faster storage to the existing instance.
Vertical wins when:
You’re early stage and operational simplicity matters — one big instance is dramatically easier to operate than a distributed cluster The workload is hard to parallelize (stateful, requires shared memory, complex coordination) You have a single-node database that can’t shard easily — scaling vertical is often faster and safer than sharding The cost per unit of performance is better vertical than horizontal at your current scale You have a resource bottleneck (CPU-bound → more cores; memory-bound → more RAM) that’s clearly addressable vertically Modern cloud instances are powerful.
Consistency and availability trade-offs show up in nearly every system design discussion. The theory (CAP, PACELC) is well-known; the practical application — knowing which choice to make for a specific use case — is what separates a design-literate engineer from one who just quotes theorems.
CAP Theorem: The Actual Claim CAP states that in the presence of a network partition, a distributed system must choose between Consistency (all nodes see the same data at the same time) and Availability (every request receives a response, though it may be stale).
The microservices vs monolith debate is one of the most over-indexed topics in software architecture — teams decompose too early, pay operational costs they’re not ready for, and spend months untangling the mess. The decision framework is simpler than the discourse suggests.
Start With the Questions, Not the Conclusion When a team says “we want to break our monolith into microservices,” the right response isn’t to approve or reject — it’s to ask:
API design decisions have long tails — once you publish an API and clients integrate with it, changing it is expensive. The choice of protocol, versioning strategy, and backwards compatibility approach should be deliberate, not defaults.
REST: The Default Choice and Why It’s Usually Right REST is HTTP-native — it uses standard verbs (GET, POST, PUT, PATCH, DELETE), status codes, headers, and content negotiation. It’s stateless, cacheable, and every HTTP client in existence can call it.
The choice between a message queue and an event streaming platform shapes your architecture more than almost any other infrastructure decision. Getting it wrong means rebuilding — not reconfiguring. Here’s how to think through it.
Message Queue vs Event Streaming: The Fundamental Distinction This distinction matters before you pick a product.
Message queue (RabbitMQ, SQS, ActiveMQ):
A message is a task or command for a consumer Typically consumed once — it’s deleted after successful processing Consumer drives the pace — pull or push, but once processed, it’s gone Good for: work distribution, background job processing, decoupled command execution Event streaming (Kafka, Kinesis, Google Pub/Sub):