A service without observability is a service the team is operating blind. The team can see whether the process is running. They cannot see how often each endpoint is being called, what the latency distribution looks like, what the error rate is per endpoint, or where time goes inside a slow request. When something is wrong, the diagnosis is by guesswork: increase the log level, deploy, wait for the next failure, hope the new logs say something useful, repeat. The loop is the work of operating a black box, and it is exactly the work that observability is meant to eliminate. Most services skip this work because the value is invisible until the first incident, at which point the team is fixing the incident instead of adding instrumentation.
Instrumentation done well looks the same across services: structured JSON logs with trace context, RED metrics (request rate, error rate, duration histograms) per endpoint, distributed tracing spans that show where time goes inside a request and across service boundaries, health checks that actually check dependencies. The discipline is well-known and rarely applied per service because adding it after the service exists is more work than adding it at start. The /vigil-instrument skill produces the full set of layers as the default so the service is operable on day one rather than after the first incident.
Why generalist AI ships under-instrumented services
Ask Cursor or ChatGPT for observability on a service. You get a console.log and maybe a Prometheus counter for total requests. The output is technically observability and operationally insufficient. The total request counter cannot be broken down by endpoint, status, or method, so the team cannot see where the errors are. The console.log strings are not structured, so they cannot be queried. There are no traces, so a slow request is a black box. The instrumentation passes the prompt's bar ("add observability") and fails the operational bar.
The other failure mode is the inconsistent fields. Logs from one service have a userId field; logs from another have user_id. Logs from one service include a trace ID; logs from another do not. A request that crosses three services produces logs in three different shapes that cannot be correlated. The team adds a query layer to normalize the differences and pays the maintenance cost of that layer indefinitely. The fix is to standardize the fields at instrumentation time, which requires a pattern the team has agreed to and applied consistently. /vigil-instrument produces that consistency.
What production observability requires
A production-instrumented service has four layers. Structured logs: JSON output with consistent fields per entry (timestamp, level, message, service, trace_id, span_id, request_id, user_id when applicable), routed to stdout for the platform's log shipper. RED metrics: request rate (counter), error rate (counter, broken down by status), duration (histogram with the right buckets), all labeled by endpoint and method. Distributed tracing: spans for the request, child spans for downstream calls (database queries, external APIs, queue publishes), context propagation via W3C trace context headers so spans correlate across services. Health checks: /healthz for liveness (is the process responsive), /readyz for readiness (are dependencies reachable), with the right semantics for each.
OpenTelemetry is the standard for the metrics and tracing layers because it is vendor-neutral. The same instrumentation produces output that Datadog, Honeycomb, Grafana Tempo, New Relic, and others can ingest, so the team is not locked in. The discipline is to use OpenTelemetry as the instrumentation API and the project's chosen vendor as the destination, with the configuration determining where the data flows.
How /vigil-instrument works
Step one: detect the stack and target
When invoked, /vigil-instrument reads the project to detect the language, framework, and existing observability stack (Datadog, Honeycomb, Grafana, New Relic, OpenTelemetry collector). The detection drives the output: the instrumentation API is OpenTelemetry, the export targets are configured for the project's existing tools.
Step two: structured logging with trace context
The skill produces a logger configuration with the standardized fields and trace context propagation. Every log entry includes the active trace and span IDs so logs correlate with traces in the observability tool. The fields are consistent across services so a query that filters by user_id works across the whole system. Logs route to stdout for the platform's log shipper rather than to a custom destination, so the existing shipping infrastructure handles them.
Step three: RED metrics and tracing
RED metrics are added per endpoint with the right cardinality (label by route template, not by full path with IDs). The duration histogram uses buckets calibrated to the service's expected latency profile. Distributed tracing spans wrap the request handler, with child spans for downstream calls (database, HTTP, queue). Context propagation uses W3C Trace Context headers so spans flow across services without custom plumbing.
Step four: health checks calibrated to dependencies
The /healthz endpoint returns 200 if the process is responsive (catches deadlocks); the /readyz endpoint checks dependencies (database, message broker, downstream services) and returns 503 if any required dependency is unreachable. The split lets Kubernetes restart deadlocked pods via liveness while routing traffic correctly via readiness. The dependency check has a timeout shorter than the readiness probe interval so it does not pile up.
High-cardinality metric labels (user_id, request_id) blow up metrics storage. /vigil-instrument labels metrics by route template, status code, and method only; per-request data lives in traces, where the cost is bounded. This single decision keeps metrics infrastructure affordable as the service grows.
Tonone's /vigil-instrument skill instruments services with structured logs, RED metrics, distributed tracing, and health checks using OpenTelemetry as the vendor-neutral standard.
When to use /vigil-instrument, and when not to
/vigil-instrument is the right call when a service has no observability and the team cannot tell what it is doing in production, before going on-call for a service the team did not write, or before a launch when the team needs the ability to diagnose problems immediately. The skill is also the right call when an existing service has partial observability (logs but no traces, metrics but with the wrong cardinality) and the team is consolidating.
Skip the skill for one-off scripts where observability is overhead. For SLO-based alerting that uses the metrics this skill produces, /vigil-alert is the right call. For incident response when an alert fires, /vigil-incident leads the diagnosis.
| Capability | Tonone | Generalist chatbot | Cursor / Copilot |
|---|---|---|---|
| Structured logs with trace context | Yes, JSON with trace IDs | console.log strings | Framework default |
| RED metrics per endpoint | Yes, with calibrated cardinality | Total counter only | Vendor-specific |
| Distributed tracing spans | Yes, with W3C context propagation | Not in scope | Vendor-specific |
| OpenTelemetry-based (vendor-neutral) | Yes, switchable destinations | Vendor-specific output | Vendor-specific |
| Liveness vs readiness split | Yes, /healthz and /readyz | Single /health | Often missing |
A worked example: instrumenting a Node.js API
Suppose the brief is: instrument a Node.js API that ships to Datadog. Run /vigil-instrument.
// src/observability/index.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes as A } from '@opentelemetry/semantic-conventions';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
export const sdk = new NodeSDK({
resource: new Resource({
[A.SERVICE_NAME]: 'billing-api',
[A.SERVICE_VERSION]: process.env.GIT_SHA ?? 'unknown',
[A.DEPLOYMENT_ENVIRONMENT]: process.env.ENVIRONMENT ?? 'dev',
}),
traceExporter: new OTLPTraceExporter(), // ships to Datadog OTLP endpoint
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter(),
exportIntervalMillis: 10_000,
}),
instrumentations: [
getNodeAutoInstrumentations({
// skip fs auto-instrument; too noisy
'@opentelemetry/instrumentation-fs': { enabled: false },
}),
],
});
sdk.start();
// src/observability/logger.ts
import pino from 'pino';
import { trace, context } from '@opentelemetry/api';
export const logger = pino({
level: process.env.LOG_LEVEL ?? 'info',
formatters: {
log(obj) {
const span = trace.getSpan(context.active());
if (span) {
const { traceId, spanId } = span.spanContext();
return { ...obj, trace_id: traceId, span_id: spanId };
}
return obj;
},
},
});
// src/observability/health.ts
import { checkDb, checkRedis } from './deps';
export async function readyz() {
const checks = await Promise.allSettled([
timeout(checkDb(), 1500),
timeout(checkRedis(), 1500),
]);
const failed = checks.filter((c) => c.status === 'rejected');
if (failed.length > 0) return { ok: false, failed };
return { ok: true };
}
export function healthz() {
return { ok: true };
}Tracing, metrics, structured logs with trace correlation, and a proper health split. The instrumentation is OpenTelemetry-based so the team can switch destinations without rewriting the service. Datadog ingests OTLP today; if the team moves to Honeycomb tomorrow, the destination changes and the instrumentation does not.
Related skills
/vigil-instrument produces the observability layer. For SLO-based alerts that use the metrics, /vigil-alert is the right call. For incident response that uses the traces and logs to diagnose, /vigil-incident leads the response. For a check on existing observability gaps, /vigil-check produces the audit.
Install
/vigil-instrument ships with the Vigil agent in the Tonone for Claude Code package. Install Tonone, invoke /vigil-instrument from any Claude Code session, and the skill produces the OpenTelemetry-based instrumentation calibrated to the project's stack and observability vendor.
1. Add to marketplace
2. Install Vigil
Services that operate well are the ones that are observable from day one. The skill is built so the observability layer is the default rather than the post-incident cleanup.
Frequently asked questions
- What does /vigil-instrument do?
- It instruments a service with OpenTelemetry-based observability: structured JSON logs with trace context, RED metrics per endpoint, distributed tracing spans with W3C context propagation, and health checks (liveness and readiness).
- What observability vendors does /vigil-instrument support?
- Datadog, Honeycomb, Grafana (Loki/Tempo/Mimir), New Relic, and any backend that accepts OTLP. The skill uses OpenTelemetry so the destination is configurable per project.
- How is /vigil-instrument different from a generalist adding observability?
- A generalist adds basic logging or a single counter. /vigil-instrument adds the four layers that production services need (logs with trace context, RED metrics, traces, health checks) using consistent patterns that work across services.
- When should I use /vigil-instrument?
- When a service has no observability, before going on-call for an unfamiliar service, or before a launch when the team needs diagnostic capability immediately.
- Does /vigil-instrument cap metric cardinality?
- Yes. Metrics are labeled by route template, status code, and method only; high-cardinality data (user_id, request_id) lives in traces where storage cost is bounded.
- How do I install /vigil-instrument?
- Install Tonone for Claude Code via the get-started guide at tonone.ai/get-started. /vigil-instrument ships with the Vigil agent and is invoked as a slash command in any Claude Code session. Tonone is free and MIT-licensed.
- Is /vigil-instrument free?
- Yes. The skill is part of Tonone, which is MIT-licensed. The only cost is Claude Code token usage during the work plus the observability vendor cost for ingesting the data.
- Does /vigil-instrument support languages other than Node.js?
- Yes. Python (FastAPI, Flask, Django), Go, Rust, Java (Spring Boot), Ruby, and Bun are all supported. OpenTelemetry has SDKs for each, and the skill produces the equivalent setup.