nSkillHub
Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Back to homepage

Observability: Logs, Metrics, Traces, and Alerting

Observability is the ability to understand what’s happening inside your system from the outside — from its outputs. The three pillars (logs, metrics, traces) are complementary tools, each answering different questions. Getting the combination right is what separates systems that you can reason about from systems that require tribal knowledge to debug.


Logs vs Metrics vs Traces: What Each Gives You

Logs

Logs are the raw record of events — timestamped, structured or unstructured, per-request or system-level.

What logs answer: “What exactly happened at time T in service S?” Detailed, contextual, narrative.

Structured logs: JSON-formatted logs (vs plain text) make logs queryable and filterable at scale. With plain text, you need regex. With structured logs, you query fields: service=checkout AND user_id=123 AND level=ERROR.

The ideal log statement includes:

  • Timestamp (ISO 8601, UTC)
  • Service name, instance ID
  • Trace ID and Span ID (for correlation with traces)
  • Log level (DEBUG/INFO/WARN/ERROR)
  • Message
  • Contextual fields (user_id, order_id, request_id)

What logs don’t give you: Aggregated views, trends, performance over time. Searching logs at scale is slow and expensive.

Metrics

Metrics are numeric measurements over time — counters, gauges, histograms. Designed for aggregation and trending.

What metrics answer: “How is the system performing right now, and how does it compare to yesterday?” Quantitative, aggregatable, cheap to store (numbers, not text).

The four golden signals (Google SRE):

  1. Latency: Time to serve a request (differentiate successful vs error latency)
  2. Traffic: Volume of requests (rps, tps)
  3. Errors: Rate of failed requests
  4. Saturation: How “full” the service is (CPU %, queue depth, connection pool usage)

Histograms vs averages: Average latency hides the tail. P95 and P99 tell the real story. A system with p50 latency of 10ms and p99 of 2000ms has a serious problem the average doesn’t reveal. Always alert on and discuss percentiles.

Micrometer: The standard metrics facade for Java/Spring Boot. Code emits metrics once; you plug in any backend (Prometheus, Datadog, CloudWatch) via a dependency. Never write System.out.println("count: " + count) for metrics — use a proper metrics library.

Traces

Traces follow a request across multiple services — a single logical operation broken into spans, each representing work in one service or component.

What traces answer: “Where in this multi-service chain did my request spend its time, and which service caused the latency?”

Request (total: 450ms)
├── API Gateway (5ms)
├── UserService (15ms)
├── OrderService (300ms)
│   ├── DB query (280ms) ← the bottleneck
│   └── Cache lookup (20ms)
└── NotificationService (130ms) ← async, not in critical path

Without traces, you’d know the overall request was slow (from metrics) but not which service or operation caused it.

Implementation: OpenTelemetry is the standard — vendor-neutral instrumentation. Spring Boot 3 auto-instruments common operations (HTTP requests, JDBC queries, Redis calls). Export to Jaeger, Tempo, Zipkin, or commercial APMs (Datadog APM, New Relic).

When is tracing worth the cost? Almost always in production microservices. The instrumentation overhead is < 1% CPU/memory for typical workloads. The debugging time saved on the first production incident more than pays for the setup cost. The question isn’t whether to trace — it’s which backend to use.


Debugging a Slow Service When No Alerts Are Firing

This is a common interview question. Your systematic approach:

1. Is this a p50, p95, or p99 problem? Check latency percentiles. If p50 is fine but p99 is bad, it’s intermittent — probably GC pause, lock contention, or specific request patterns. If p50 is bad, it’s systematic.

2. Check the four golden signals for the service itself and its dependencies:

  • Is traffic volume normal?
  • Is error rate elevated (even slightly)?
  • Is saturation high (thread pool, DB connection pool, CPU)?

3. Look at traces for slow requests. Where is the time going? Which span is long?

4. Check downstream dependencies. Service is slow because the DB is slow? Check DB query time, lock waits, replication lag. Cache is slow? Check Redis latency and hit rate.

5. Correlate with deployments. Did someone deploy in the last hour? Check the diff.

6. Infrastructure-level signals. Is this one pod or all pods? (One pod = instance-specific issue — maybe a JVM GC issue). Is there a correlation with time of day or traffic pattern?

7. JVM-specific for Java services. GC logs — are there long pauses? Thread dump — are threads blocked on something? Heap profiler — is memory pressure causing thrashing?


What to Alert On: Good vs Bad Alerts

The alert quality test: If the alert fires at 3am, should a human wake up to handle it? If yes, it’s a good alert. If it can wait until morning or is often a false positive, it shouldn’t page.

Good alerts:

  • Error rate > 1% for > 5 minutes (user-visible impact)
  • P99 latency > SLO for > 5 minutes
  • Availability check fails (the service returns errors or is unreachable)
  • Queue consumer lag growing for > 10 minutes (work is backing up)
  • DLQ depth > 0 (poison messages need investigation)
  • Certificate expiry < 14 days (proactive, not reactive)

Bad alerts:

  • CPU > 80% (resource metrics without user impact — just because CPU is high doesn’t mean users are affected)
  • “Server restarted” (if autoscaling or Kubernetes restarts are expected, this is noise)
  • Alerts without a clear remediation action (“what do I do if this fires?”)
  • Alerts that fire constantly and get ignored (alert fatigue — worse than no alerts)
  • Very tight thresholds that fire on minor blips

Symptom-based vs cause-based alerts:

  • Symptom-based (recommended): “Users can’t complete checkout” — fires when the user-observable outcome is broken
  • Cause-based: “DB connection pool > 90%” — may or may not mean users are affected

Alert on symptoms. Use cause-based metrics as diagnostic tools to investigate why the symptom alert fired.


Distributed Tracing: When Is It Worth It?

It’s almost always worth it for microservices. The specific scenarios where it’s indispensable:

  • Latency debugging — identifying which service in a 10-service chain caused a slowdown
  • Error propagation — understanding how an error in a downstream service surfaces to the user
  • Dependency mapping — discovering which services actually call which (as opposed to what the architecture diagram says)
  • SLO breakdown — attributing latency budget to specific services/operations

The cost:

  • Instrumentation time (~1 sprint to set up, less for Spring Boot 3 which auto-instruments)
  • Sampling strategy needed at scale — tracing every request is expensive. Sample 10% normally, 100% for errors and slow requests (tail-based sampling).
  • Storage cost for traces — traces are large compared to metrics. Retention is typically 7–30 days.

OpenTelemetry collector: The standard deployment pattern is to run an OTel Collector sidecar or DaemonSet in Kubernetes. Services emit spans to the collector; the collector batches and forwards to your backend. This decouples your application from the specific tracing backend.


PII in Logs

This is a compliance and security issue that every EM should have a clear stance on.

Never log:

  • Passwords, tokens, API keys (even hashed — logging a hash of a password is still bad practice)
  • Full payment card numbers, CVVs
  • SSNs, government IDs
  • Health information

Be careful with:

  • Email addresses (PII in GDPR, CCPA, HIPAA contexts)
  • IP addresses (PII in some jurisdictions)
  • User IDs (if linked to a real person, they’re PII — but generally safer to log as a reference)
  • Full request/response bodies (may contain any of the above)

Practical patterns:

  1. Log field masking: Middleware that strips or masks known PII fields (password, creditCard, ssn) from structured logs
  2. Log level control: Don’t log request bodies at INFO — only at DEBUG, which should be disabled in production
  3. Data classification: Tag log fields by sensitivity. Only certain teams can access logs with PII-tagged fields.
  4. Correlation IDs, not user data: Log the user ID reference (a UUID), not the email or name. Join to user data only when necessary for debugging.
  5. Log retention limits: Keep DEBUG/INFO logs for 30 days, ERROR logs for 90 days. Don’t retain indefinitely.

The accidental logging of PII in a publicly accessible logging system has caused multiple high-profile security incidents. Make PII log hygiene a code review requirement.