Cloud and Infrastructure: AWS vs GCP vs Azure, Kubernetes vs Serverless
Cloud infrastructure decisions are often more political than technical. The right answer depends on where your team’s expertise is, what your customers require, and what you’re willing to operate. Here’s how to frame these decisions at the EM level.
For most workloads, the difference between the big three is smaller than the cloud marketing suggests. Compute (VMs, containers, managed Kubernetes) is broadly equivalent. Managed databases, object storage, networking — table stakes at all three.
Where the differences are real:
AWS:
- Largest ecosystem of managed services — if it exists as a managed service, AWS probably has it
- Largest community, most third-party tooling, most engineers with AWS experience
- Most mature Kubernetes managed service (EKS) in terms of enterprise features
- Best track record for exotic instance types (GPU, FPGA, high-memory, ARM)
- The default choice when there’s no other constraint
GCP:
- BigQuery is genuinely differentiated — serverless data warehouse at massive scale with simple pricing
- Kubernetes is Google’s technology — GKE is polished and often ahead of EKS/AKS on new features
- Strong ML/AI infrastructure (TPUs, Vertex AI) if you’re building AI workloads
- Often less expensive than AWS at scale (especially for networking and egress)
- Less enterprise market share = fewer engineers to hire with GCP experience
Azure:
- The enterprise default — if your customers are Microsoft shops, Azure Active Directory integration alone drives this choice
- Best for .NET / Windows workloads, SQL Server, Active Directory integration
- Deep GitHub, DevOps, Visual Studio integrations
- Often the winner in regulated industries and government (FedRAMP, compliance certifications)
The EM answer: “Which cloud depends on your team’s existing expertise, your customers’ requirements, and any compliance constraints. For a greenfield startup with no constraints, I’d lean AWS for ecosystem breadth. For an enterprise software company, Azure integrates best with customer environments. For data-heavy or ML-heavy workloads, GCP’s tooling is strong.”
The industry-standard container orchestration platform. Self-healing, auto-scaling, declarative config.
Kubernetes wins when:
- You have multiple services that benefit from unified orchestration (deployment, scaling, service discovery, configuration)
- Your team has or can build Kubernetes operational expertise
- You need advanced deployment strategies (canary, blue-green via Argo Rollouts)
- You want workload portability (run locally, on-prem, or any cloud)
- You want to add a service mesh, advanced networking, or custom admission controllers
The cost: Kubernetes is complex. The control plane (managed on EKS/GKE/AKS), worker nodes, networking (CNI), storage (CSI), secrets management, ingress, monitoring — each layer requires understanding and maintenance. Managed Kubernetes reduces but doesn’t eliminate this.
The honest guideline: If you have fewer than 5–10 services or a small team, Kubernetes is likely overkill. It pays off at scale or when you have multiple teams deploying independently.
Simpler container orchestration, AWS-proprietary. Run containers on EC2 (ECS on EC2) or fully serverless (AWS Fargate).
ECS + Fargate wins when:
- You’re AWS-native and want the simplest container hosting
- You don’t need Kubernetes features (advanced scheduling, custom CRDs, service mesh)
- You want truly serverless container hosting (Fargate handles infrastructure)
- Your team doesn’t want to manage Kubernetes
Limitation: AWS-only. No portability. Less ecosystem than Kubernetes (no Helm charts, Argo, Tekton, etc.).
Code runs on-demand. No servers to manage, pay per invocation.
Lambda wins when:
- Event-driven processing — process S3 events, SQS messages, DynamoDB streams, API calls
- Infrequent or highly variable workloads — scales to zero (pay nothing when idle), scales to thousands of concurrent executions instantly
- CLI tools, scheduled jobs — no need for an always-on process
- Startup time is acceptable — typical cold starts are 100–500ms, configurable with provisioned concurrency
- Stateless operations — functions are ephemeral; no local state between invocations
Lambda’s limitations:
- Max execution time: 15 minutes per invocation — not for long-running jobs
- Cold start latency: The first invocation (or after a period of inactivity) takes longer. Provisioned concurrency eliminates this but adds cost.
- Container egress: Lambda in a VPC for DB access requires NAT Gateway — adds cost and latency
- Observability is harder — function logs are per-invocation; distributed tracing requires explicit instrumentation
- Not for always-on services — if your service has constant traffic, an always-on container is cheaper
Cloud Run (GCP): HTTP-based container hosting that scales to zero. A middle ground — you bring your container, Cloud Run handles scaling, including scale-to-zero. Less cold start than Lambda for containerized workloads.
Still valid. For stateful workloads, databases, workloads requiring specific kernel configuration, or when you need maximum control.
When VMs make sense:
- Running databases self-hosted (you need I/O tuning, kernel parameters)
- Workloads requiring low-level performance tuning (huge pages, NUMA awareness, specific kernel versions)
- Legacy applications that can’t be containerized
- When containerization overhead matters (extreme performance workloads)
A service mesh (Istio, Linkerd, Consul Connect) moves cross-cutting concerns out of application code and into the infrastructure layer.
What a service mesh gives you:
- mTLS automatically — every service-to-service call is encrypted and authenticated
- Traffic management — canary deployments, traffic splitting, retries, timeouts at the mesh level (no code changes)
- Observability — automatic metrics and traces for every service-to-service call without application instrumentation
- Circuit breaking and load balancing — at the sidecar level, not in your code
- Authorization policies — “service A is allowed to call service B; service C is not”
The cost:
- Operational complexity — Istio especially is known for being complex to operate. Misconfigured Istio has caused more production incidents than it has prevented for teams without the expertise.
- Sidecar overhead — each pod gets a sidecar container (Envoy). Small CPU/memory overhead per pod (~50MB, ~5ms per request).
- Debugging complexity — when traffic doesn’t flow correctly, diagnosing mesh config vs app config vs network is non-trivial.
When it’s worth it:
- You have 10+ services with serious cross-cutting concerns (mTLS, traffic management, observability)
- You have a dedicated platform engineering team to operate the mesh
- Compliance requires service-level identity and encryption
- You want canary/blue-green deployments without application code changes
When it’s overkill:
- Small team, few services
- You don’t need all the features — if you just want mTLS, Linkerd is much simpler than Istio
- Your team will spend more time debugging the mesh than building features
The lightweight alternative: Linkerd (much simpler to operate than Istio), or just network policies + mutual TLS at the application level for critical paths.
The case for multi-cloud:
- Avoid vendor lock-in
- Regulatory requirements to use multiple clouds
- Different clouds have genuinely better services for different workloads (GCP for ML + AWS for primary)
- Negotiating leverage with cloud providers
The reality:
- Running workloads across multiple clouds requires abstraction layers (Terraform, Kubernetes) that add complexity
- Managed services (S3, RDS, BigQuery) are cloud-specific — true portability means avoiding them, which means missing significant managed service value
- Most teams who commit to multi-cloud spend significant engineering time on the portability layer, not the product
- Cloud vendor lock-in is real but overestimated — the cost of migration is high but so is the cost of operating two cloud environments
The honest EM answer: “Multi-cloud sounds strategic but is operationally expensive. I’d choose the right cloud for our workload, invest in infrastructure-as-code (Terraform/Pulumi) so we could migrate if forced, and avoid proprietary managed services only where the lock-in risk outweighs the operational simplicity. Using two clouds for genuinely different purposes (e.g., AWS for the product + GCP BigQuery for analytics) is reasonable and different from ’everything runs on both clouds.'”