Cloud and Infrastructure: AWS vs GCP vs Azure, Kubernetes vs Serverless

Apr 7, 2026 6 minutes to read

Cloud infrastructure decisions are often more political than technical. The right answer depends on where your team’s expertise is, what your customers require, and what you’re willing to operate. Here’s how to frame these decisions at the EM level.

AWS vs GCP vs Azure: Does It Actually Matter?

For most workloads, the difference between the big three is smaller than the cloud marketing suggests. Compute (VMs, containers, managed Kubernetes) is broadly equivalent. Managed databases, object storage, networking — table stakes at all three.

Where the differences are real:

AWS:

Largest ecosystem of managed services — if it exists as a managed service, AWS probably has it
Largest community, most third-party tooling, most engineers with AWS experience
Most mature Kubernetes managed service (EKS) in terms of enterprise features
Best track record for exotic instance types (GPU, FPGA, high-memory, ARM)
The default choice when there’s no other constraint

GCP:

BigQuery is genuinely differentiated — serverless data warehouse at massive scale with simple pricing
Kubernetes is Google’s technology — GKE is polished and often ahead of EKS/AKS on new features
Strong ML/AI infrastructure (TPUs, Vertex AI) if you’re building AI workloads
Often less expensive than AWS at scale (especially for networking and egress)
Less enterprise market share = fewer engineers to hire with GCP experience

Azure:

The enterprise default — if your customers are Microsoft shops, Azure Active Directory integration alone drives this choice
Best for .NET / Windows workloads, SQL Server, Active Directory integration
Deep GitHub, DevOps, Visual Studio integrations
Often the winner in regulated industries and government (FedRAMP, compliance certifications)

The EM answer: “Which cloud depends on your team’s existing expertise, your customers’ requirements, and any compliance constraints. For a greenfield startup with no constraints, I’d lean AWS for ecosystem breadth. For an enterprise software company, Azure integrates best with customer environments. For data-heavy or ML-heavy workloads, GCP’s tooling is strong.”

Kubernetes vs ECS vs Serverless vs VMs

Kubernetes

The industry-standard container orchestration platform. Self-healing, auto-scaling, declarative config.

Kubernetes wins when:

You have multiple services that benefit from unified orchestration (deployment, scaling, service discovery, configuration)
Your team has or can build Kubernetes operational expertise
You need advanced deployment strategies (canary, blue-green via Argo Rollouts)
You want workload portability (run locally, on-prem, or any cloud)
You want to add a service mesh, advanced networking, or custom admission controllers

The cost: Kubernetes is complex. The control plane (managed on EKS/GKE/AKS), worker nodes, networking (CNI), storage (CSI), secrets management, ingress, monitoring — each layer requires understanding and maintenance. Managed Kubernetes reduces but doesn’t eliminate this.

The honest guideline: If you have fewer than 5–10 services or a small team, Kubernetes is likely overkill. It pays off at scale or when you have multiple teams deploying independently.

AWS ECS (Elastic Container Service)

Simpler container orchestration, AWS-proprietary. Run containers on EC2 (ECS on EC2) or fully serverless (AWS Fargate).

ECS + Fargate wins when:

You’re AWS-native and want the simplest container hosting
You don’t need Kubernetes features (advanced scheduling, custom CRDs, service mesh)
You want truly serverless container hosting (Fargate handles infrastructure)
Your team doesn’t want to manage Kubernetes

Limitation: AWS-only. No portability. Less ecosystem than Kubernetes (no Helm charts, Argo, Tekton, etc.).

Serverless Functions (Lambda, Cloud Functions, Cloud Run)

Code runs on-demand. No servers to manage, pay per invocation.

Lambda wins when:

Event-driven processing — process S3 events, SQS messages, DynamoDB streams, API calls
Infrequent or highly variable workloads — scales to zero (pay nothing when idle), scales to thousands of concurrent executions instantly
CLI tools, scheduled jobs — no need for an always-on process
Startup time is acceptable — typical cold starts are 100–500ms, configurable with provisioned concurrency
Stateless operations — functions are ephemeral; no local state between invocations

Lambda’s limitations:

Max execution time: 15 minutes per invocation — not for long-running jobs
Cold start latency: The first invocation (or after a period of inactivity) takes longer. Provisioned concurrency eliminates this but adds cost.
Container egress: Lambda in a VPC for DB access requires NAT Gateway — adds cost and latency
Observability is harder — function logs are per-invocation; distributed tracing requires explicit instrumentation
Not for always-on services — if your service has constant traffic, an always-on container is cheaper

Cloud Run (GCP): HTTP-based container hosting that scales to zero. A middle ground — you bring your container, Cloud Run handles scaling, including scale-to-zero. Less cold start than Lambda for containerized workloads.

VMs (EC2, Compute Engine)

Still valid. For stateful workloads, databases, workloads requiring specific kernel configuration, or when you need maximum control.

When VMs make sense:

Running databases self-hosted (you need I/O tuning, kernel parameters)
Workloads requiring low-level performance tuning (huge pages, NUMA awareness, specific kernel versions)
Legacy applications that can’t be containerized
When containerization overhead matters (extreme performance workloads)

Service Mesh: What It Solves and When It’s Overkill

A service mesh (Istio, Linkerd, Consul Connect) moves cross-cutting concerns out of application code and into the infrastructure layer.

What a service mesh gives you:

mTLS automatically — every service-to-service call is encrypted and authenticated
Traffic management — canary deployments, traffic splitting, retries, timeouts at the mesh level (no code changes)
Observability — automatic metrics and traces for every service-to-service call without application instrumentation
Circuit breaking and load balancing — at the sidecar level, not in your code
Authorization policies — “service A is allowed to call service B; service C is not”

The cost:

Operational complexity — Istio especially is known for being complex to operate. Misconfigured Istio has caused more production incidents than it has prevented for teams without the expertise.
Sidecar overhead — each pod gets a sidecar container (Envoy). Small CPU/memory overhead per pod (~50MB, ~5ms per request).
Debugging complexity — when traffic doesn’t flow correctly, diagnosing mesh config vs app config vs network is non-trivial.

When it’s worth it:

You have 10+ services with serious cross-cutting concerns (mTLS, traffic management, observability)
You have a dedicated platform engineering team to operate the mesh
Compliance requires service-level identity and encryption
You want canary/blue-green deployments without application code changes

When it’s overkill:

Small team, few services
You don’t need all the features — if you just want mTLS, Linkerd is much simpler than Istio
Your team will spend more time debugging the mesh than building features

The lightweight alternative: Linkerd (much simpler to operate than Istio), or just network policies + mutual TLS at the application level for critical paths.

Multi-Cloud: Smart Hedge or Expensive Distraction?

The case for multi-cloud:

Avoid vendor lock-in
Regulatory requirements to use multiple clouds
Different clouds have genuinely better services for different workloads (GCP for ML + AWS for primary)
Negotiating leverage with cloud providers

The reality:

Running workloads across multiple clouds requires abstraction layers (Terraform, Kubernetes) that add complexity
Managed services (S3, RDS, BigQuery) are cloud-specific — true portability means avoiding them, which means missing significant managed service value
Most teams who commit to multi-cloud spend significant engineering time on the portability layer, not the product
Cloud vendor lock-in is real but overestimated — the cost of migration is high but so is the cost of operating two cloud environments

The honest EM answer: “Multi-cloud sounds strategic but is operationally expensive. I’d choose the right cloud for our workload, invest in infrastructure-as-code (Terraform/Pulumi) so we could migrate if forced, and avoid proprietary managed services only where the lock-in risk outweighs the operational simplicity. Using two clouds for genuinely different purposes (e.g., AWS for the product + GCP BigQuery for analytics) is reasonable and different from ’everything runs on both clouds.'”