Monitoring and Observability: Building Confidence in Production

Monitoring vs. Observability

Monitoring and observability are related but distinct concepts. Monitoring is about checking known metrics against predefined thresholds — is the CPU above 90%? Is the site up? Observability goes further: it's the ability to ask arbitrary questions about your system's behavior without having anticipated every possible question in advance.

As Henry Kissinger reportedly said, "Monitoring is knowing that everything is going wrong. Observability is figuring out why."

The Three Pillars

Modern observability rests on three pillars:

Metrics — Quantitative measurements over time (CPU usage, request latency, error rates).
Logs — Discrete, timestamped records of events (application errors, access records, deployment events).
Traces — End-to-end records of a request as it flows through distributed services.

Together, they give you a complete picture of system health and behavior.

Setting Up Prometheus and Grafana

Prometheus is a time-series metrics database that pulls (scrapes) metrics from your services. Grafana provides beautiful dashboards to visualize that data.

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'application'
    metrics_path: '/metrics'
    static_configs:
      - targets: ['app:8080']

Instrument your application to expose a /metrics endpoint. For a Node.js app using the prom-client library:

const client = require('prom-client');
const collectDefaultMetrics = client.collectDefaultMetrics;

collectDefaultMetrics({ prefix: 'myapp_' });

app.get('/metrics', async (req, res) => {
  res.set('Content-Type', client.register.contentType);
  res.end(await client.register.metrics());
});

Key Metrics to Track

Not all metrics are created equal. Focus on what matters:

RED method (for services): Rate of requests, Errors, and Duration (latency).
USE method (for resources): Utilization, Saturation, and Errors.
Business metrics: Conversion rate, revenue per minute, active users.

A practical Grafana dashboard might include:

Panel	Metric	Alert Threshold
Request Rate	`rate(http_requests_total[5m])`	Below 100 req/s
Error Rate	`rate(http_requests_total{status=~"5.."}[5m])`	Above 1%
P95 Latency	`histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))`	Above 500ms
CPU Usage	`node_cpu_seconds_total`	Above 80%

Structured Logging with JSON

Unstructured logs are hard to search and analyze. Switch to structured JSON logging:

{
  "timestamp": "2026-05-21T10:30:00Z",
  "level": "ERROR",
  "service": "payment-api",
  "trace_id": "abc-123-def",
  "message": "Payment processing failed",
  "error": "Gateway timeout",
  "user_id": "usr_456"
}

Tools like the ELK stack (Elasticsearch, Logstash, Kibana) or Grafana Loki excel at ingesting and querying structured logs at scale.

Setting Up Alerts

Alerts should be actionable, not noisy. Follow the rule: every alert must have a clear runbook.

# Example Prometheus alerting rule
groups:
  - name: application
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is above 5% for the last 2 minutes."
          runbook: "https://wiki.internal/runbooks/high-error-rate"

Conclusion

Observability isn't a one-time setup — it's an ongoing practice. Start with the basics: collect metrics, centralize logs, and set up sensible alerts. Then iterate based on what incidents reveal about your blind spots. The goal isn't to prevent every outage; it's to reduce the time between "something is wrong" and "we know what's wrong and how to fix it."

Monitoring and Observability: Building Confidence in Production

Monitoring vs. Observability

The Three Pillars

Setting Up Prometheus and Grafana

Key Metrics to Track

Structured Logging with JSON

Setting Up Alerts

Conclusion

The Signal

Key takeaways

What to watch next

Who should care

Key players

One sharp read on the day’s biggest tech story.

Related reading

AI in Production: Challenges and Best Practices

Five Frontier Models Disagree on 67% of Real-World Claims

Greenhouse Deployment Strategies: Ship With Confidence