📡 Monitoring & Observability

Knowing what your application is doing in production.

Three pillars of observability

Logs

What happened, in sequence.

Structured JSON logs with context
Include: timestamp, level, message, request ID, user ID
Aggregate with ELK, Loki, or CloudWatch

Metrics

Numerical measurements over time.

Request rate, error rate, latency (RED method)
CPU, memory, disk, connections (USE method)
Business metrics (signups, orders, revenue)
Collect with Prometheus, Datadog, or CloudWatch

Traces

Request flow across services.

Distributed tracing with unique trace IDs
Visualize with Jaeger, Tempo, or Datadog APM
Essential for microservices debugging

Key metrics to track

RED method (for services)

Rate — Requests per second
Errors — Error rate (% of failed requests)
Duration — Latency (p50, p95, p99)

USE method (for infrastructure)

Utilization — % of resource used
Saturation — Queue depth, waiting work
Errors — Hardware/system errors

Alerting

Severity	Response time	Example
P1 Critical	< 15 min	Service down, data loss
P2 High	< 1 hour	Error rate > 5%
P3 Medium	< 4 hours	Degraded performance
P4 Low	Next business day	Warning threshold

Alerting best practices

Alert on symptoms, not causes (high error rate, not CPU usage)
Every alert must be actionable
Include a runbook link in every alert
Avoid alert fatigue — if you ignore it, fix or remove it
Use escalation policies for critical alerts

Dashboards

Service health: RED metrics per endpoint
Infrastructure: CPU, memory, disk, network
Business: Key product metrics
Keep dashboards simple — 5-8 panels max