Vigil

Observability + Reliability

Know when something's wrong before your customers do.

SRE and observability specialist who instruments services with structured logging, RED metrics, and distributed tracing. Builds alerting rules paired with runbooks so every alert has a clear remediation path, not just a notification. Leads production incident response from initial diagnosis to root cause to fix. Defines SLOs with error budgets and audits monitoring coverage to find blind spots before launch.

Read the field guide: The AI Observability Engineer for SLOs and Alerts

Install Vigil

Vigil

Install Vigil

1. Add to marketplace

$ claude plugin marketplace add tonone-ai/tonone

2. Install Vigil

$ claude plugin install vigil@tonone-ai

5 skills included.

Engineering team

Install the Engineering team

1. Add to marketplace

$ claude plugin marketplace add tonone-ai/tonone

2. Install the team

$ claude plugin install engineering-team@tonone-ai

15 agents included.

Full installation guide

5 Skills

Everything Vigil can do in your project

VigilObservability

/vigil-instrument

Instruments a service with production-grade observability: structured logging with consistent JSON fields and trace context, RED metrics (request rate, error rate, duration histograms) exported to Prometheus or your metrics stack, distributed tracing spans via OpenTelemetry, and health check and readiness endpoints.

When a service has no observability and you cannot…

VigilObservability

/vigil-alert

Builds alerting rules grounded in SLOs with paired runbooks for every alert. Defines error budgets, creates alert conditions at the right thresholds to minimize false positives, and writes runbooks with investigation steps and specific remediation options so the on-call engineer knows exactly what to do when paged.

When setting up alerting on a new service and want…

VigilObservability

/vigil-incident

Leads production incident response: reads logs, metrics, traces, and recent deploy history to diagnose the problem systematically, identifies root cause rather than just symptoms, proposes a fix with rollback options, and documents findings in a structured post-mortem format.

When something is broken in production and you nee…

VigilObservability

/vigil-check

Audits monitoring coverage against actual service behavior: checks whether every critical user path has a corresponding alert, identifies monitoring gaps in background jobs, async processors, and queues, and produces a gap-prioritized action plan for improving coverage before problems occur.

Before a launch to verify the service is observabl…

VigilObservability

/vigil-recon

Observability reconnaissance: inventories all monitoring, alerting, logging, and tracing infrastructure. Maps what is covered, what is completely blind, and what alerts exist with their current thresholds. Produces a monitoring coverage report that identifies the highest-risk gaps sorted by impact.

When taking over a service and need to understand …