AI-Native Engineering
Agentic SRE — CultureTech Playbook
sre ai operations
Context
“AIOps with LLMs” is a popular promise. The reality: most impressive demos are agents that correlate logs and produce a plausible response — without guarantees of correctness, without rollback, without audit.
For SRE that’s not enough. SRE operates critical infrastructure. An agent that “vibe-checks” a Kafka corruption and restarts the cluster without human confirmation is itself an incident.
This playbook describes how to apply agents to SRE keeping the operational rigor the discipline demands.
3 fundamental rules
Rule 1: the agent’s state is legible
Every agent must have an explicit finite state machine. “The LLM reasons” is not enough. States like:
idle— waiting for signal.triaging— classifying an alert.executing-runbook— executing step of runbook X.escalating-human— case exceeded agent confidence.
The operating human must be able to answer “what is the agent doing now?” by looking at the state, not inferring from the last log.
Rule 2: runbooks are contracts, not suggestions
The agent must execute declared runbooks, not invent steps. Each runbook is a graph of steps with:
- Verifiable precondition.
- Concrete action.
- Verifiable postcondition.
- Rollback mechanism.
The LLM chooses which runbook to execute, not which individual steps to take. This is similar to how a human SRE consults their wiki — the difference is discipline, not creativity.
Rule 3: the agent is observable like any other system
Every agent action is logged with:
- Trigger that started it.
- LLM reasoning (prompt + output).
- Chosen runbook + execution state.
- Result (success, failure, escalation).
This goes to the same observability pipeline as the rest of the infra. When the agent fails, it fails with context.
Anti-patterns
Anti-pattern 1: letting the agent decide runbooks dynamically
Happens when the LLM not only chooses runbook but writes the steps on the fly. Impossible to audit, impossible to test.
Fix: runbooks are versioned artifacts in repo. The agent picks from existing runbooks.
Anti-pattern 2: unlimited trust
When the agent can execute destructive actions (restart clusters, delete pods, etc.) without human confirmation in any case.
Fix: define a confidence threshold. Below X% confidence, the agent escalates before acting. And for specific destructive actions (defined in a list), always escalates, regardless of confidence.
Anti-pattern 3: a single monolithic agent
When there’s a single “SRE Agent” covering Kafka, Kubernetes, PostgreSQL, and everything else. Complexity explodes, audit becomes impossible.
Fix: one agent per domain (Themis follows this pattern). The Kafka domain has its own agent with its own FSM and its own runbooks. Kubernetes another.
Typical use case
Agent receives alert kafka-consumer-lag > 1h on critical topic.
- State:
triaging. - Reasoning: “High lag on topic X. Candidate runbooks:
restart-consumer-group,scale-consumer,check-broker-health.” - Action: consult metrics to choose among the three. Detects broker has 95% CPU.
- State:
executing-runbook: check-broker-health. - Runbook action: process check, GC pause, network.
- Result: 2.5s GC pause detected.
- State:
escalating-human. Reason: corrective action (adjustXX:MaxGCPauseMillis) requires broker restart. Not destructive per se, but the explicit list marks it as “requires human confirmation”. - Oncall human gets Slack with: original alert + full reasoning + proposed runbook. Approves or adjusts.
Total time from alert to proposal: 90 seconds. Without agent, same diagnosis takes 15-30 minutes of an SRE.
When NOT to use agentic SRE
- Small SRE team (<5 people). Overhead of maintaining runbooks + agent observability exceeds savings.
- Simple infrastructure (one monolith + DB). Not enough alert density to justify.
- No pre-existing runbook culture. Agentic SRE amplifies good discipline, doesn’t create it.
Interested in exploring?
If you have an SRE team and want to discuss the specific use case: 60-minute Technical Deep Dive, free.