Monitoring vs. Observability
Monitoring and observability are related but distinct concepts. Monitoring is about checking known metrics against predefined thresholds — is the CPU above 90%? Is the site up? Observability goes further: it's the ability to ask arbitrary questions about your system's behavior without having anticipated every possible question in advance.
As Henry Kissinger reportedly said, "Monitoring is knowing that everything is going wrong. Observability is figuring out why."
The Three Pillars
Modern observability rests on three pillars:
- Metrics — Quantitative measurements over time (CPU usage, request latency, error rates).
- Logs — Discrete, timestamped records of events (application errors, access records, deployment events).
- Traces — End-to-end records of a request as it flows through distributed services.
Together, they give you a complete picture of system health and behavior.
Setting Up Prometheus and Grafana
Prometheus is a time-series metrics database that pulls (scrapes) metrics from your services. Grafana provides beautiful dashboards to visualize that data.
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'application'
metrics_path: '/metrics'
static_configs:
- targets: ['app:8080']
Instrument your application to expose a /metrics endpoint. For a Node.js app using the prom-client library:
const client = require('prom-client');
const collectDefaultMetrics = client.collectDefaultMetrics;
collectDefaultMetrics({ prefix: 'myapp_' });
app.get('/metrics', async (req, res) => {
res.set('Content-Type', client.register.contentType);
res.end(await client.register.metrics());
});
Key Metrics to Track
Not all metrics are created equal. Focus on what matters:
- RED method (for services): Rate of requests, Errors, and Duration (latency).
- USE method (for resources): Utilization, Saturation, and Errors.
- Business metrics: Conversion rate, revenue per minute, active users.
A practical Grafana dashboard might include:
| Panel | Metric | Alert Threshold |
|---|---|---|
| Request Rate | rate(http_requests_total[5m]) |
Below 100 req/s |
| Error Rate | rate(http_requests_total{status=~"5.."}[5m]) |
Above 1% |
| P95 Latency | histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) |
Above 500ms |
| CPU Usage | node_cpu_seconds_total |
Above 80% |
Structured Logging with JSON
Unstructured logs are hard to search and analyze. Switch to structured JSON logging:
{
"timestamp": "2026-05-21T10:30:00Z",
"level": "ERROR",
"service": "payment-api",
"trace_id": "abc-123-def",
"message": "Payment processing failed",
"error": "Gateway timeout",
"user_id": "usr_456"
}
Tools like the ELK stack (Elasticsearch, Logstash, Kibana) or Grafana Loki excel at ingesting and querying structured logs at scale.
Setting Up Alerts
Alerts should be actionable, not noisy. Follow the rule: every alert must have a clear runbook.
# Example Prometheus alerting rule
groups:
- name: application
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is above 5% for the last 2 minutes."
runbook: "https://wiki.internal/runbooks/high-error-rate"
Conclusion
Observability isn't a one-time setup — it's an ongoing practice. Start with the basics: collect metrics, centralize logs, and set up sensible alerts. Then iterate based on what incidents reveal about your blind spots. The goal isn't to prevent every outage; it's to reduce the time between "something is wrong" and "we know what's wrong and how to fix it."