Script Valley
System Design: APIs, Caching & Scalability
System Design End-to-EndLesson 6.5

Observability in production: metrics, logging, and tracing

observability pillars, structured logging, distributed tracing, metrics vs logs, SLO and SLA, alerting on symptoms not causes, correlation IDs, OpenTelemetry

Observability in production: metrics, logging, and tracing

Observability three pillars

The Three Pillars

Metrics: numeric measurements over time such as requests per second, error rate, p99 latency, cache hit ratio. Used for dashboards, alerting, and capacity planning.

Logs: timestamped event records. Use structured JSON so logs are machine-parseable and queryable. Never log PII. Always include correlation IDs.

Traces: distributed traces follow a single request across multiple services, measuring time spent in each component. Essential for diagnosing latency in microservices.

Structured Logging

logger.info({
  event: 'request.completed',
  method: 'GET',
  path: '/users/42',
  statusCode: 200,
  durationMs: 23,
  requestId: 'req-abc-123',
  userId: 42
});

Correlation IDs and SLO Alerting

Generate a unique requestId at the API gateway. Pass it as a header through every downstream service call and log it everywhere. When a user reports a bug, one ID reconstructs the entire request flow across all services and logs.

Alert on symptoms not causes: error rate above 1% or p99 latency above 500ms, not CPU above 80%. SLOs define the reliability targets your service commits to — violating one is what pages an engineer. OpenTelemetry provides a vendor-neutral standard for instrumentation across all three pillars.