π‘ Monitoring & Observability β
Knowing what your application is doing in production.
Three pillars of observability β
Logs β
What happened, in sequence.
- Structured JSON logs with context
- Include: timestamp, level, message, request ID, user ID
- Aggregate with ELK, Loki, or CloudWatch
Metrics β
Numerical measurements over time.
- Request rate, error rate, latency (RED method)
- CPU, memory, disk, connections (USE method)
- Business metrics (signups, orders, revenue)
- Collect with Prometheus, Datadog, or CloudWatch
Traces β
Request flow across services.
- Distributed tracing with unique trace IDs
- Visualize with Jaeger, Tempo, or Datadog APM
- Essential for microservices debugging
Key metrics to track β
RED method (for services) β
- Rate β Requests per second
- Errors β Error rate (% of failed requests)
- Duration β Latency (p50, p95, p99)
USE method (for infrastructure) β
- Utilization β % of resource used
- Saturation β Queue depth, waiting work
- Errors β Hardware/system errors
Alerting β
| Severity | Response time | Example |
|---|---|---|
| P1 Critical | < 15 min | Service down, data loss |
| P2 High | < 1 hour | Error rate > 5% |
| P3 Medium | < 4 hours | Degraded performance |
| P4 Low | Next business day | Warning threshold |
Alerting best practices β
- Alert on symptoms, not causes (high error rate, not CPU usage)
- Every alert must be actionable
- Include a runbook link in every alert
- Avoid alert fatigue β if you ignore it, fix or remove it
- Use escalation policies for critical alerts
Dashboards β
- Service health: RED metrics per endpoint
- Infrastructure: CPU, memory, disk, network
- Business: Key product metrics
- Keep dashboards simple β 5-8 panels max