Skip to content

πŸ“‘ Monitoring & Observability ​

Knowing what your application is doing in production.

Three pillars of observability ​

Logs ​

What happened, in sequence.

  • Structured JSON logs with context
  • Include: timestamp, level, message, request ID, user ID
  • Aggregate with ELK, Loki, or CloudWatch

Metrics ​

Numerical measurements over time.

  • Request rate, error rate, latency (RED method)
  • CPU, memory, disk, connections (USE method)
  • Business metrics (signups, orders, revenue)
  • Collect with Prometheus, Datadog, or CloudWatch

Traces ​

Request flow across services.

  • Distributed tracing with unique trace IDs
  • Visualize with Jaeger, Tempo, or Datadog APM
  • Essential for microservices debugging

Key metrics to track ​

RED method (for services) ​

  • Rate β€” Requests per second
  • Errors β€” Error rate (% of failed requests)
  • Duration β€” Latency (p50, p95, p99)

USE method (for infrastructure) ​

  • Utilization β€” % of resource used
  • Saturation β€” Queue depth, waiting work
  • Errors β€” Hardware/system errors

Alerting ​

SeverityResponse timeExample
P1 Critical< 15 minService down, data loss
P2 High< 1 hourError rate > 5%
P3 Medium< 4 hoursDegraded performance
P4 LowNext business dayWarning threshold

Alerting best practices ​

  • Alert on symptoms, not causes (high error rate, not CPU usage)
  • Every alert must be actionable
  • Include a runbook link in every alert
  • Avoid alert fatigue β€” if you ignore it, fix or remove it
  • Use escalation policies for critical alerts

Dashboards ​

  • Service health: RED metrics per endpoint
  • Infrastructure: CPU, memory, disk, network
  • Business: Key product metrics
  • Keep dashboards simple β€” 5-8 panels max

Pergame Knowledge Base