Production incidents follow a predictable shape. An alert fires. Whoever is on call pulls up the dashboards, checks the logs, looks at the recent deploys, and starts forming a hypothesis about what is wrong. The first hypothesis is usually wrong. The second hypothesis takes another twenty minutes to test. By the time the actual root cause is identified, the customers who hit the bug have already filed support tickets, the team has spent forty minutes in the war room, and the on-call engineer is several espresso shots into a Saturday morning that was supposed to be quiet. Incident response is the most expensive engineering work a team does, measured in stress rather than tokens, and the cost compounds when the diagnosis is slow because every minute of slow diagnosis is a minute of the incident continuing to affect customers.
The structural reason diagnosis is slow is that it requires correlating across data sources that are not connected. The logs are in one tool. The metrics are in another. The traces are in a third. The deploy history is in the CI tool. The application source is in the editor. Tying these together requires switching tabs, copying request IDs, scrolling through timelines, and holding the whole picture in working memory. The work is exactly the kind of correlation that an AI tool with structured access to the data sources can do faster than a human, *if* the tool is built for it. Generalist coding assistants are not built for it; they cannot pull data from observability tools, they have no notion of a deploy history, and they treat the codebase as the only data source. The /vigil-incident skill is built for the correlation: it reads the logs, the metrics, the traces, and the deploy timeline, holds them together, and produces a hypothesis with the evidence behind it.
Why generalist AI is the wrong tool for incidents
Ask Cursor or ChatGPT "why is the API returning 503s" and you get a list of generic causes: maybe the database is down, maybe a dependency is failing, maybe the rate limiter is misconfigured. The list is not wrong. It is also useless because it is not grounded in your data. The actual cause is one of those, or a combination, or none of them, and the only way to know is to look at the logs, the metrics, and the deploys for *this* incident at *this* time. A generalist tool cannot do that. It can produce a checklist, and it can suggest commands to run, but it cannot see the data and it cannot correlate across sources. The on-call engineer is doing all of the actual diagnosis work; the tool is providing a checklist that the engineer mostly already knew.
The other failure mode is the codebase-only mental model. Generalist tools are excellent at reasoning about code: "this function does X, this caller passes Y, here is the bug." They are weak at reasoning about deployed systems, where the code is one input but the runtime state, the dependency health, the upstream service status, and the recent configuration changes are equally important. An incident is rarely a pure code bug. It is usually code that worked in isolation interacting with a runtime condition that was not anticipated: a slow upstream, a saturated cache, a deploy that landed at the same time as a traffic spike. Diagnosing that requires the system view, not the code view. /vigil-incident operates on the system view by reading the observability data alongside the codebase.
What incident response actually requires
The standard incident response loop has four steps. First, characterize the symptom: what is broken, who is affected, when did it start, what is the blast radius. Second, form a hypothesis: based on the symptom and the recent changes, what is the most likely cause. Third, test the hypothesis: pull the logs, metrics, or traces that would confirm or refute it. Fourth, propose a fix: revert the suspect deploy, scale the saturated service, patch the code, fail over the unhealthy region. The loop is iterative; the first hypothesis is rarely right, but the testing step narrows the search quickly when it is grounded in real data. Fast diagnosis is not a matter of intuition; it is a matter of cycling through the loop quickly, with each cycle informed by the evidence the previous cycle produced.
The deploy timeline is often the most important data source and the one most easily overlooked. The vast majority of production incidents trace back to a recent change, and the change is often visible in the deploy log if the on-call engineer thinks to look. The discipline of "check the deploy timeline first" catches a third of incidents in the first minute, and the discipline of correlating the symptom timeline with the deploy timeline catches another third in the first ten minutes. The remaining third are the genuinely hard cases: an upstream dependency degraded, a slow data corruption, a security event. Those require the deeper correlation across logs, metrics, and traces. /vigil-incident is built to apply both disciplines automatically: the deploy timeline is the first input, and the cross-source correlation is the second.
How /vigil-incident works
Step one: gather the symptom and the timeline
When /vigil-incident is invoked, it asks for the symptom in concrete terms (what is broken, when did it start, who is affected) and pulls the deploy timeline for the last several hours. The deploy timeline is correlated with the symptom: did the symptom start within minutes of a deploy, did the affected service change recently, did a configuration change land that touches the suspect path. If the timeline points at a clear suspect, that becomes the first hypothesis without further work. If the timeline is clean, the skill moves to the data sources.
Step two: read logs, metrics, and traces
The skill pulls the relevant data from the observability tools the project uses. Error logs filtered to the affected service in the affected window. Metrics for the symptoms (latency, error rate, throughput, saturation) plus the leading indicators (upstream latency, queue depth, cache hit rate). Traces for representative failed requests, with the spans inspected for the actual failure point. The data is summarized in the output: "latency p99 spiked from 200ms to 1.5s starting at 14:23, traces show 80% of failures originating in the database query for X, the relevant query has a missing index per the schema reasoning." The summary is the input to the hypothesis.
Step three: form and test the hypothesis
Based on the data summary, /vigil-incident proposes a hypothesis with the evidence behind it. The hypothesis is structured: "the suspect cause is X because the symptom timeline matches X's deploy at 14:20 and the trace data shows the failure inside the code paths X changed." The skill also lists the alternative hypotheses ranked by likelihood and what would distinguish them. The on-call engineer reads the hypothesis, decides whether the evidence is convincing, and either accepts it or asks for an alternative to be tested. The discipline is the same as a senior SRE applies under pressure: do not commit to a hypothesis until the evidence supports it, but do not stall in analysis when the evidence is conclusive.
Step four: propose a fix with rollback
Once the hypothesis is confirmed, /vigil-incident proposes a fix and a rollback path. The fix is calibrated to the severity: a hot fix in code, a configuration change, a service scale-up, a region failover, a deploy revert. The rollback path is the steps to take if the fix makes the situation worse. Both are surfaced before any action is taken, so the operator decides what risk to accept. For deploy reverts, the skill writes the revert commit. For configuration changes, the skill writes the change. For scale-ups, the skill produces the command. The operator runs the action; the skill watches the metrics for the recovery.
After the incident is resolved, /vigil-incident produces the post-mortem draft from the timeline it has already gathered. The post-mortem is the artifact that catches the systemic issue (the missing alert, the missing index, the missing test) so the next incident is shorter or does not happen at all.
Tonone's /vigil-incident skill leads production incident response by correlating logs, metrics, traces, and deploy history, proposing a hypothesis with evidence and a fix with rollback path.
When to use /vigil-incident, and when not to
/vigil-incident is the right call any time something is broken in production and the team needs structured diagnosis. The signal is when an alert fires, when a customer report names a specific symptom, when latency or error rate has spiked. The skill is also the right call after the incident is resolved to write the post-mortem; the timeline and evidence are already gathered, so the post-mortem is a structured pass over data the skill already has.
Skip the skill for slow performance issues that are not active incidents (use /spine-perf for backend hot paths, /flux-query for slow queries). For pre-incident work, /vigil-instrument adds the observability that makes future incidents faster to diagnose, and /vigil-alert writes the SLO-based alerts that catch problems before customers do. For audits of the existing observability posture, /vigil-check produces the gap report.
| Capability | Tonone | Generalist chatbot | Cursor / Copilot |
|---|---|---|---|
| Reads logs, metrics, traces from the project's tools | Yes, integrates with Datadog/Honeycomb/Grafana/Sentry | No data source access | No data source access |
| Correlates symptom with deploy timeline | Yes, deploy history pulled and matched | Generic checklist | Not in scope |
| Hypothesis grounded in actual data | Yes, evidence cited per hypothesis | Generic causes listed | Suggests within current line |
| Fix with rollback path | Yes, both surfaced before action | Fix only, no rollback | Not applicable |
| Post-mortem draft from incident timeline | Yes, generated from data already gathered | Boilerplate template | Not in scope |
A worked example: 503s on the API
Suppose the alert fires: "API error rate above 5% for 3 minutes." Run /vigil-incident and the output looks like the following abbreviated form.
Symptom: API error rate at 18% (baseline 0.4%), started 14:23 UTC.
Affected: all customers, all endpoints. Severity: SEV-2.
Deploy timeline (last 6 hours):
14:18 - api: deploy d3a2f1 (PR #4521, 'cache invalidation refactor')
12:05 - workers: deploy 9f7c2e (PR #4519, 'metric tagging fix')
09:30 - infra: terraform apply (RDS parameter change: max_connections 200 -> 100)
First hypothesis (HIGH confidence):
RDS max_connections lowered from 200 to 100 at 09:30.
Connection pool in api defaults to max 200.
At 14:23, traffic ramp + new cache miss pattern from d3a2f1 saturated
the pool and caused 'no available connection' errors.
Evidence:
- Database error logs show 'remaining connection slots reserved'
starting at 14:23.
- api connection pool metric saturated at 14:21 (within margin of
the deploy that increased read load).
- Cache hit rate dropped from 92% to 41% after d3a2f1 deployed.
Alternative hypotheses (lower likelihood):
- Pure regression in d3a2f1: less likely because the symptom is
'no connection available', not 'wrong query result'.
- External provider outage: status pages clean, traces show local
DB failure not external.
Proposed fix (recommended):
1. Apply terraform to revert max_connections to 200 (5 min).
2. Verify connection pool drops below saturation in api metrics.
3. After recovery, investigate cache miss pattern from d3a2f1
in a follow-up.
Rollback path:
If max_connections increase causes RDS instability, fail over
to standby and restore previous instance.
Post-mortem captures:
- terraform change should have triggered a connection-pool review.
- cache hit rate is not currently alerted; should be.
- next on-call rotation: add cache hit rate alert.The output is exactly what a senior SRE would produce after twenty minutes of focused work, except it arrived in two minutes and the operator can immediately act on it. The post-mortem captures section feeds into the follow-up work that prevents the next incident, which is the part that often gets skipped when the team is exhausted at the end of an incident.
Related skills
/vigil-incident is the reactive skill. The proactive skills that reduce future incident frequency are /vigil-instrument (adds the observability so problems are visible) and /vigil-alert (writes the SLO-based alerts so problems are caught before customers report them). For a structured audit of the current observability posture, /vigil-check produces the gap report.
Install
/vigil-incident ships with the Vigil agent in the Tonone for Claude Code package. Install Tonone, configure the observability tool integrations the project uses, and the skill is available in any Claude Code session when an alert fires.
1. Add to marketplace
2. Install Vigil
Incidents end faster when the diagnosis is grounded in evidence rather than guesses. The skill is built for the structured loop a senior SRE runs under pressure, and it runs that loop in minutes.
Frequently asked questions
- What does /vigil-incident do?
- It leads production incident response by reading logs, metrics, traces, and deploy history, correlating them to form a hypothesis grounded in evidence, and proposing a fix with a rollback path. After resolution, it produces a post-mortem draft.
- How is /vigil-incident different from a generalist AI helping with debugging?
- A generalist returns generic checklists without data access. /vigil-incident integrates with the project's observability tools, pulls the actual data for the incident window, and grounds hypotheses in evidence the operator can verify.
- What observability tools does /vigil-incident support?
- Datadog, Honeycomb, Grafana (with Loki, Tempo, Mimir), Sentry, New Relic, and OpenTelemetry-based stacks are supported. The skill reads from whichever is configured.
- When should I use /vigil-incident?
- When something is broken in production and you need structured diagnosis. Also after the incident to write the post-mortem while the timeline is still in memory.
- Does /vigil-incident execute the fix?
- No. The skill proposes the fix and the rollback path. The operator runs the action so they retain control over the production system. The skill watches the metrics for recovery after the operator has acted.
- How do I install /vigil-incident?
- Install Tonone for Claude Code via the get-started guide at tonone.ai/get-started. /vigil-incident ships with the Vigil agent and requires the project's observability tool credentials configured. Tonone is free and MIT-licensed.
- Is /vigil-incident free?
- Yes. The skill is part of Tonone, which is MIT-licensed. The only cost is Claude Code token usage during the work.
- Does /vigil-incident replace an SRE?
- No. It speeds up the diagnosis loop a senior SRE runs by handling the cross-tool correlation that is most time-consuming. The operator still owns the decision to act and the judgment about severity.