Advanced Distributed Systems ConceptsLesson 6.4

How to design for observability — metrics, logging, and tracing

three pillars of observability, structured logging, distributed tracing, RED metrics, SLO vs SLA, alerting on symptoms not causes

Three Pillars of Observability

You can't fix what you can't see. Observability tells you what your system is doing and why it's misbehaving — before customers report it.

Metrics — The RED Method

For every service, track:

Rate: requests per second
Errors: error rate (4xx, 5xx)
Duration: latency (p50, p95, p99)

Structured Logging

// Bad: unstructured
console.log('User 123 failed to login')

// Good: structured JSON
logger.error({
  event: 'login_failed',
  user_id: 123,
  reason: 'invalid_password',
  ip: '1.2.3.4',
  timestamp: new Date().toISOString()
})

Distributed Tracing

A trace follows a request across multiple services. Each service adds a span with start time, duration, and metadata. Tools: Jaeger, Zipkin, OpenTelemetry. Trace ID propagates in HTTP headers (X-Trace-ID).

Alerting Philosophy

Alert on symptoms (error rate > 1%, p99 latency > 500ms), not causes (CPU > 80%). A CPU spike that doesn't affect user experience isn't an incident. An error rate spike that users experience immediately is.