Models decay silently. The accuracy that was 92% at launch is 87% three months later. The latency was 200ms; it is 350ms now. The token cost was $0.003 per request; it crept to $0.012 because input distribution shifted toward longer prompts. Each of these is invisible without an evaluation routine that runs against a reference dataset and tracks the deltas.
The /cortex-eval skill evaluates a deployed model or LLM integration across four dimensions: accuracy regression against a reference dataset, distribution drift on inputs and outputs, latency regression compared to baseline, and cost shifts. The output is a health report with recommended actions: retrain the model, refresh the prompt, switch the provider, address the upstream data shift.
What the eval covers
Accuracy: held-out reference set scored against the deployed model with the metrics calibrated to the task. Distribution drift: KS test on input features, histogram comparison on output predictions. Latency: p99 of the deployed endpoint vs the baseline at launch. Cost: token-per-request and total spend trends with breakdown by user or feature.
How /cortex-eval works
The skill connects to the model serving layer and the production logs, runs the reference set, computes drift, and pulls the latency and cost metrics. It produces the health report with severity per dimension. Recommended actions are scoped: a small accuracy drop and stable distribution suggests a refresh; a big distribution shift suggests retraining or a prompt update.
Tonone's /cortex-eval skill evaluates deployed models for accuracy regression, distribution drift, latency baseline, and cost shifts.
Related skills
Install
/cortex-eval ships with the Cortex agent in Tonone for Claude Code.
1. Add to marketplace
2. Install Cortex
Frequently asked questions
- What does /cortex-eval do?
- It evaluates a deployed model or LLM integration for accuracy regression, distribution drift, latency baseline, and cost shifts.
- How do I install /cortex-eval?
- Install Tonone for Claude Code via tonone.ai/get-started.