Monitoring WebSocket servers in production
connected client gauge, message rate counter, connection error rate, latency percentiles, Redis memory usage, dead connection detection, Prometheus metrics, alerting thresholds
What to Measure in a WebSocket Server
These are the four metrics every WebSocket server must export:
Connected clients:
wss.clients.sizeas a gauge. Spike = viral event or DDoS. Sudden drop = server crash.Message rate: Increment a counter per message received and sent. Use a sliding window to get messages/sec.
Error rate: Count per-client errors and connection rejections. Rising error rate without rising connections = client bug or protocol mismatch.
Round-trip latency: Measure ping-to-pong latency in your heartbeat and record as a histogram. p99 above 500ms indicates overload.
Exposing Metrics with Prometheus
const client = require('prom-client'); const connectedClients = new client.Gauge({ name: 'ws_connected_clients', help: 'Number of connected WebSocket clients' }); const messagesTotal = new client.Counter({ name: 'ws_messages_total', help: 'Total messages received', labelNames: ['direction'] }); wss.on('connection', (ws) => { connectedClients.inc(); ws.on('message', () => messagesTotal.inc({ direction: 'in' })); ws.on('close', () => connectedClients.dec()); }); // Expose /metrics for Prometheus scraping app.get('/metrics', async (req, res) => { res.set('Content-Type', client.register.contentType); res.end(await client.register.metrics()); });
