Observability in production: metrics, logging, and tracing
observability pillars, structured logging, distributed tracing, metrics vs logs, SLO and SLA, alerting on symptoms not causes, correlation IDs, OpenTelemetry
Observability in production: metrics, logging, and tracing
The Three Pillars
Metrics: numeric measurements over time such as requests per second, error rate, p99 latency, cache hit ratio. Used for dashboards, alerting, and capacity planning.
Logs: timestamped event records. Use structured JSON so logs are machine-parseable and queryable. Never log PII. Always include correlation IDs.
Traces: distributed traces follow a single request across multiple services, measuring time spent in each component. Essential for diagnosing latency in microservices.
Structured Logging
logger.info({
event: 'request.completed',
method: 'GET',
path: '/users/42',
statusCode: 200,
durationMs: 23,
requestId: 'req-abc-123',
userId: 42
});Correlation IDs and SLO Alerting
Generate a unique requestId at the API gateway. Pass it as a header through every downstream service call and log it everywhere. When a user reports a bug, one ID reconstructs the entire request flow across all services and logs.
Alert on symptoms not causes: error rate above 1% or p99 latency above 500ms, not CPU above 80%. SLOs define the reliability targets your service commits to — violating one is what pages an engineer. OpenTelemetry provides a vendor-neutral standard for instrumentation across all three pillars.
