Back to articles
May 21, 2026

SRE Practices for Small Teams: Reliability Without the Overhead

What Is SRE? Site Reliability Engineering (SRE) is a discipline that applies software engineering principles to infrastructure and operations problems. Coined by Google engineers Ben Treynor Sloss…

Placeholder cover imagePhoto: Lorem Picsum / Unsplash

What Is SRE?

Site Reliability Engineering (SRE) is a discipline that applies software engineering principles to infrastructure and operations problems. Coined by Google engineers Ben Treynor Sloss and Liz Rice, SRE treats reliability as a software problem — something you can measure, automate, and continuously improve.

The common misconception is that SRE requires a large, dedicated team. In reality, small teams can and should adopt SRE practices. The key is focusing on the principles, not the bureaucracy.

Error Budgets: The Core Concept

An Error Budget is the amount of reliability you can "spend" before your service becomes unreliable. It's derived from your Service Level Objective (SLO).

If your SLO is 99.9% availability (the "three nines"), your error budget is 0.1% — roughly 43 minutes of downtime per month. When you've consumed the budget, release new features until reliability recovers. When you have budget to spare, ship fast.

Error Budget = 1 - SLO
Example: SLO = 99.9% → Error Budget = 0.1% per month

This creates a natural tension between feature development and reliability — the same tension that drives good engineering decisions.

Service Level Indicators and Objectives

  • SLO (Service Level Objective) — A target reliability level for your service. This is the goal.
  • SLI (Service Level Indicator) — The actual measurement of reliability. This is the data.
  • SLA (Service Level Agreement) — A contractual commitment, usually with penalties. Avoid SLAs until you're ready for them.
# Example: Calculating an SLI in Python
def calculate_sli(total_requests, successful_requests):
    """
    SLI = successful requests / total requests
    """
    return successful_requests / total_requests if total_requests > 0 else 0

# If you get 9992 successful requests out of 10000:
sli = calculate_sli(10000, 9992)  # Returns 0.9992
# This meets a 99.9% SLO

Toil: Measure and Reduce It

Google's SRE book defines toil as the kind of work that is manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly with service size.

Small teams are especially vulnerable to toil because every engineer wears multiple hats. Track your toil as a percentage of engineering time:

Toil Percentage = (Toil Hours / Total Engineering Hours) × 100
Target: Below 50%

If your team spends 70% of their time on toil, you need to invest in automation. Even small wins — a script to restart a service, a dashboard to replace manual checks — compound over time.

Blameless Postmortems

When something goes wrong, the goal is learning, not assigning blame. A blameless postmortem follows a simple structure:

  1. Timeline — What happened, in chronological order.
  2. Impact — How many users were affected, for how long.
  3. Root causes — The technical and process factors that led to the incident.
  4. Action items — Specific, assigned tasks to prevent recurrence.
## Incident Report: Payment API Outage — May 21, 2026

### Timeline
- 14:00 — Latency spikes detected on payment API
- 14:05 — Alerts fire, on-call engineer investigates
- 14:15 — Root cause identified: database connection pool exhaustion
- 14:30 — Connection pool increased, service recovers

### Root Cause
The recent deployment increased transaction volume by 3x, but the database
connection pool was sized for the previous traffic level.

### Action Items
- [ ] Set up auto-scaling for connection pools (assignee: @alice)
- [ ] Add connection pool utilization to dashboards (assignee: @bob)
- [ ] Review deployment thresholds for traffic-sensitive services (assignee: @carol)

Practical First Steps for Small Teams

  1. Define one SLO for your most critical service. Start simple.
  2. Build a single dashboard showing that SLO's SLI over time.
  3. Run a blameless postmortem after every significant incident.
  4. Automate one repetitive task each sprint.
  5. Track error budget consumption and make release decisions based on it.

Conclusion

SRE isn't about hiring a reliability army or building a massive monitoring platform. It's about cultivating a mindset: measure what matters, reduce toil, learn from incidents, and let data drive your trade-offs between speed and reliability. Small teams that adopt these practices early gain a compounding advantage in stability and developer happiness.