Most LLM features in production today were prompt-engineered through a sequence of edit-paste-eyeball loops. An engineer wrote a prompt, ran it on a few examples, tweaked a sentence, ran it again, eyeballed the output, shipped it. The feature went live. A week later, somebody realized the model was now refusing certain requests, or producing slightly worse summaries, or hallucinating in a corner case nobody had tested. The team did not know whether the regression came from a prompt change, a model upgrade, or a shift in the input distribution. They could not roll back because the prompt history was not versioned. They started edit-paste-eyeball again. That loop is how most prompts in production are maintained, and it is why most LLM features in production are quietly worse than the demo that got them shipped.
Prompt engineering done well looks like the rest of software engineering done well: prompts are versioned, evaluated against representative inputs, and changes are gated by the eval results. A regression in eval scores fails the build the same way a unit test failure does. A model upgrade triggers a rerun of the eval suite to confirm the prompts still hold. The discipline is not exotic; it is the same kind of discipline applied to model code, transferred to the part of the codebase that happens to be natural language. The /cortex-prompt skill is built to apply that discipline: it produces the eval suite, the prompt version, and the documentation of what changed and why, so the team can ship LLM features without flying blind.
Why ad hoc prompt iteration is the wrong default
The eyeball-driven prompt loop has two failure modes that compound. First, the eyeball is a small sample. The handful of examples the engineer ran are not representative; the prompt that wins on those five examples may lose on the next fifty. Second, the eyeball cannot see drift. A change that improves output on the test cases the engineer remembers may regress on cases the engineer has not thought about in months. The two combine into a regression rate that nobody is measuring, which means nobody is fixing, which means the LLM feature decays slowly until the support tickets force someone to look. Generalist AI tools encourage this loop because they make iterative editing fast; they do not provide the eval scaffolding that would catch the regressions.
The deeper problem is that prompts are part of the system's behavior, and behavior changes need the same review process as code changes. A prompt change that ships without an eval is a behavior change shipping without a test. The fix is not more vigilance; it is the eval suite that runs on every prompt change automatically. Once the suite exists, the iteration loop is fast again, but it is now safe: the engineer can iterate freely, and the suite tells them whether the iteration helped or hurt. That is the loop /cortex-prompt is built around.
What production prompt engineering requires
A production-grade prompt has four parts. First, the versioned prompt itself, with the model name, the model parameters, and any system instructions, all checked into the repository. Second, the eval suite: a representative set of inputs paired with the criteria for a good output, ranging from exact-match assertions for structured tasks to LLM-as-judge or human-rated rubrics for open-ended tasks. Third, the metrics: the rolled-up scores from the eval suite, with regression thresholds that fail the build below the bar. Fourth, the version log: a changelog of prompt edits with the reason for each change and the eval delta it produced. With these in place, the prompt is a first-class artifact in the codebase rather than a string somebody pastes into the call site.
The eval suite is the part teams skip first because it is the most upfront work. The skip is also the source of every regression that ships, which is why the discipline is to build the suite before shipping the feature. A small representative suite (twenty to fifty cases per task, calibrated to cover the input distribution) catches more regressions than a thousand random samples, and it costs less to run. The point of the suite is not exhaustive coverage; it is the signal-to-noise ratio that lets the team know when a change is good or bad.
How /cortex-prompt works
Step one: characterize the task
Before designing the prompt, /cortex-prompt asks for the task in concrete terms: what input the prompt receives, what output it should produce, what counts as success, and what the failure modes are. The task definition becomes the input to the eval design. A summarization task has different evaluation criteria from a classification task, which has different criteria from an open-ended generation task. The skill is opinionated about not designing prompts past underspecified tasks; if the success criteria are vague, they are surfaced as questions before any prompt is written.
Step two: build the eval suite
The eval suite is generated from a representative input set: real or synthetic examples that cover the expected distribution. Each example is paired with the evaluation method appropriate to the task. For structured tasks (classification, extraction), the method is exact match or a rubric. For free-form tasks, the method is an LLM-as-judge prompt with a calibration sample, or a human rating rubric for cases where automated judging is not reliable. The suite is checked into the repository alongside the prompts so the eval is reproducible and the cost of running it is bounded.
Step three: design the prompt with techniques
The prompt is designed using systematic techniques calibrated to the task: structured output formatting with explicit schemas where the downstream code parses results, few-shot examples drawn from the eval suite for tasks where the model benefits from demonstrations, chain-of-thought instructions for tasks that require reasoning, and explicit failure-mode handling for cases where the model would otherwise produce a wrong-shape output. The prompt is run against the eval suite immediately so the baseline scores are recorded as the starting point.
Step four: version and evaluate every change
Once the baseline is in place, every prompt change is run through the suite before merging. The version log records the change, the reason, and the eval delta. If the delta is positive, the change ships; if it is negative, the change is rejected or refined. Model upgrades are treated as prompt changes for the purposes of the suite: a new model version is run against the same suite to confirm the prompts still hold. The discipline is what separates a prompt that improves over time from one that decays.
LLM-as-judge is reliable when the judge is calibrated against a small human-rated sample. /cortex-prompt produces both the judge prompt and the calibration step, so the eval signal is trustworthy rather than vibes-based.
Tonone's /cortex-prompt skill designs, versions, and evaluates prompts for LLM features with eval suites that catch regressions before they reach users.
When to use /cortex-prompt, and when not to
/cortex-prompt is the right call when building or maintaining an LLM-powered feature where prompt quality affects user experience. The signal is when prompts are checked in as strings without versioning, when changes happen without eval, or when a model upgrade is causing unexplained quality changes. The skill is also the right call when an existing feature has degraded silently and the team needs to establish a quality baseline before further iteration.
Skip the skill for one-off prompts that are not in production (research scripts, internal automation that runs once a quarter). For LLM integration concerns beyond prompts (caching, fallbacks, cost controls), /cortex-integrate is the right call. For evaluating ML model performance more broadly, /cortex-eval is calibrated to that work.
| Capability | Tonone | Generalist chatbot | Cursor / Copilot |
|---|---|---|---|
| Prompt versioned in the repository | Yes, with model and parameters | String literal in code | Single-line autocomplete |
| Eval suite generated alongside prompt | Yes, representative inputs with rubrics | No, eyeball iteration | No |
| Regression thresholds in CI | Yes, build fails below bar | No | No |
| Calibrated LLM-as-judge | Yes, with human-rated calibration sample | Vibes-based | Not in scope |
| Version log with eval deltas | Yes, every change documented | Untracked iteration | Not in scope |
A worked example: prompt for ticket triage
Suppose the task is: classify incoming support tickets into one of five categories and extract the customer name. Run /cortex-prompt and the output is the prompt plus the eval suite plus the version log.
# prompts/triage_v3.yaml
model: claude-sonnet-4-6
temperature: 0
max_tokens: 200
system: |
You classify support tickets into one of five categories and
extract the customer name. Return JSON with fields {category,
customer_name}. Categories: billing, technical, account, sales,
other.
user_template: |
Ticket: {{ticket_body}}
Sender email: {{sender_email}}
# evals/triage.yaml
metric: accuracy_and_f1
threshold: 0.92
examples:
- ticket_body: "My card was charged twice last Tuesday"
sender_email: "[email protected]"
expected:
category: billing
customer_name: "Jane"
# ...30 more cases covering the distribution
# prompts/CHANGELOG.md
## v3 (2026-03-27)
Added explicit JSON schema in system prompt.
Eval delta: accuracy 0.89 -> 0.94, parse failures 8% -> 0%.
## v2 (2026-03-12)
Reverted COT in v1 because it slowed parse and lost 3pt accuracy.
Eval delta: 0.86 -> 0.89.The prompt, the eval suite, and the changelog are checked in together. Any future change runs the suite and updates the changelog. A model upgrade triggers a rerun. The team always knows whether the feature is getting better or worse, and they have an artifact to point at when they need to explain why.
Related skills
/cortex-prompt covers the prompt itself. For the LLM integration in the surrounding service (caching, fallbacks, cost controls), /cortex-integrate is the right call. For broader ML model evaluation, /cortex-eval produces the drift and accuracy reports.
Install
/cortex-prompt ships with the Cortex agent in the Tonone for Claude Code package. Install Tonone, invoke /cortex-prompt from any Claude Code session, and the skill produces a versioned prompt with an eval suite calibrated to the task.
1. Add to marketplace
2. Install Cortex
Prompts are part of the codebase even if they look like prose. The skill treats them that way: versioned, evaluated, and changed only when the eval supports it.
Frequently asked questions
- What does /cortex-prompt do?
- It designs, versions, and evaluates prompts for LLM features. The output includes a versioned prompt file, an eval suite with representative inputs and evaluation rubrics, and a version log that records every change with its eval delta.
- How is /cortex-prompt different from generalist AI helping me iterate on a prompt?
- A generalist suggests rewrites without measurement. /cortex-prompt builds the eval suite that tells you whether a rewrite is actually better, and gates changes in CI on the eval threshold.
- When should I use /cortex-prompt?
- When building or maintaining an LLM-powered feature in production. Skip it for one-off scripts or research notebooks where prompt regressions do not affect users.
- What evaluation methods does /cortex-prompt support?
- Exact match and rubrics for structured tasks (classification, extraction), LLM-as-judge with human-rated calibration for free-form generation, and human rating workflows for cases where automated judging is not reliable.
- Does /cortex-prompt work with multiple model providers?
- Yes. The skill is provider-agnostic and works with Claude (Anthropic), GPT (OpenAI), Gemini (Google), open-source models via vLLM or Ollama, and Vercel AI Gateway. Model name and parameters are part of the versioned prompt.
- How do I install /cortex-prompt?
- Install Tonone for Claude Code via the get-started guide at tonone.ai/get-started. /cortex-prompt ships with the Cortex agent and is invoked as a slash command in any Claude Code session. Tonone is free and MIT-licensed.
- Is /cortex-prompt free?
- Yes. The skill is part of Tonone, which is MIT-licensed. The only cost is Claude Code token usage during the work plus the LLM tokens used to run the eval suite.
- Can /cortex-prompt detect prompt regressions on a model upgrade?
- Yes. Model upgrades are treated as prompt changes; the eval suite is rerun against the new model and any regressions surface in the version log so the team can decide to update the prompt or pin the model.