The first version of an LLM-powered feature is almost always a prototype dressed in production clothing. Somebody wired the OpenAI client to a route handler, the route returns the model's output, and the feature works. The feature also has no retry logic when the API is slow, no fallback when the API is down, no caching when the same query repeats, no token budget when a malicious user starts hitting it in a loop, and no streaming when the user is waiting on a slow response. The prototype meets the demo bar. It does not meet the production bar, and the gap between the two becomes visible the first time the API has a bad day, the first invoice arrives, or the first user discovers they can extract free tokens by repeating the same prompt.
A production-ready LLM integration has structure that the prototype skipped. Retry logic with exponential backoff that distinguishes retryable from non-retryable errors. Streaming response handling that pushes tokens to the user as they arrive rather than buffering the full response. Semantic caching that returns cached results for queries similar to ones already answered. Fallback provider configuration so the feature still works when the primary provider has an outage. Token budgets per user and per request so cost cannot run away. Rate limiting per identity so abuse is bounded. The /cortex-integrate skill is built to wrap an LLM call with these layers so the feature ships ready for the second user, not just the first.
Why prototype LLM integrations are fragile
Generalist tools produce LLM integrations that mirror the SDK examples. The example code calls the API, returns the result, and ends there because that is what the SDK example demonstrates. The example is correct as a starter; it is not the integration that survives contact with real traffic. The first time the provider has elevated latency, the prototype's request hangs. The first time the provider returns a 5xx, the prototype propagates the error to the user. The first time a user enters a prompt similar to one already cached, the prototype pays for the call again. None of these are subtle bugs; they are the standard production concerns that the prototype skipped.
The other failure mode is cost. LLM calls are not free, and a feature without token budgets can be turned into an attack vector with very little effort. A user who discovers that a prompt-injection bypasses content filters and produces large outputs can rack up significant cost on the team's account before the team notices. A bug that causes a route to call the LLM in a tight loop can produce the same outcome by accident. The discipline of token budgets and rate limits is the discipline of bounding the worst case, and it is exactly what the prototype omits because the demo did not need it.
What a production LLM integration requires
A production-grade integration has six layers. First, the API client with retry: exponential backoff on retryable errors (5xx, 429), no retry on non-retryable (4xx other than 429), with a maximum retry count and a maximum total wait. Second, streaming response support so the user sees tokens as they are generated, with the option to abort if the request is canceled client-side. Third, semantic cache: an embedding-based cache that returns prior responses for queries similar enough to satisfy the use case, with a similarity threshold tuned to the task. Fourth, fallback provider configuration: a second provider that takes over when the primary fails, with the prompt format normalized so the fallback works without rewrites. Fifth, token budget enforcement: a per-user daily limit, a per-request limit, and a hard ceiling that triggers an alert before the team's bill ramps. Sixth, rate limiting per identity: a request-rate limit so a single user cannot dominate the queue.
Each layer is small on its own and considerable in aggregate. The integration that has all six is the one that the team can leave running over a holiday weekend without losing sleep. The integration that has none of them is the one that produces the postmortem. The discipline of including all six is what /cortex-integrate makes routine, by treating the integration as the artifact rather than the SDK call.
How /cortex-integrate works
Step one: detect the stack and provider
Before generating any code, /cortex-integrate reads the project to detect the language, the framework, and any existing AI provider integration. The detection drives the output: a Next.js project gets the AI SDK pattern, a FastAPI project gets the Python equivalent, an existing Anthropic SDK integration is extended rather than replaced. The provider preference defaults to Vercel AI Gateway with provider-prefixed model strings, falling back to direct provider SDKs when the project uses them, and never assumes a specific vendor without checking.
Step two: wrap the call with retry and streaming
The integration wraps the LLM call with retry logic (exponential backoff on retryable errors, max 3 retries, jitter) and streaming response handling. Streaming is the default for any user-facing route because the perceived latency improvement is meaningful. The wrapper also handles client-side cancellation: if the user aborts, the wrapper aborts the upstream call rather than letting it complete and waste tokens.
Step three: semantic cache and fallback
The semantic cache is configured against the project's existing cache layer (Redis if available, otherwise the cache that the framework provides). The cache key is an embedding of the prompt; lookups use cosine similarity with a configurable threshold. The fallback provider is configured for graceful degradation: if the primary provider is unhealthy, the wrapper switches to the fallback for the duration of the outage. The prompt format is normalized so both providers receive equivalent input; the response format is normalized so the application sees the same shape regardless of which provider answered.
Step four: token budget and rate limit
Token budgets are enforced per user (daily limit, configurable), per request (max input plus output, calibrated to the task), and globally (a hard ceiling that triggers an alert). Rate limiting is per identity: a request-rate limit per user so a single user cannot saturate the queue. Both budgets and rate limits are surfaced as 429 responses to the client with the right headers (Retry-After, X-RateLimit-Remaining), so the client can back off without a redesign.
The prompt-injection vector that turns into a cost attack is real and underestimated. /cortex-integrate sets a hard per-request output ceiling so a malicious prompt cannot generate a 100k-token completion. The ceiling is calibrated to the task and surfaced as a configurable parameter.
Tonone's /cortex-integrate skill wraps LLM calls with the layers that make integrations production-ready: retry, streaming, semantic cache, fallback provider, token budgets, rate limits.
When to use /cortex-integrate, and when not to
/cortex-integrate is the right call when adding an LLM-powered feature to an existing product or when an existing prototype needs to be hardened to production standards. The signal is when the integration is missing any of the six layers (no caching, no fallback, no rate limit, no token budget, no streaming, no retry) and the feature is going to a real user. The skill is also the right call when an existing integration is expensive or unreliable; the diagnostic step identifies which layer is missing or misconfigured.
Skip the skill for one-off scripts and internal automation that runs once a week (the production layers are overhead). For prompt design specifically, /cortex-prompt is the right call. For broader ML model evaluation and drift detection, /cortex-eval is calibrated to that work.
| Capability | Tonone | Generalist chatbot | Cursor / Copilot |
|---|---|---|---|
| Retry with exponential backoff | Yes, with retryable/non-retryable distinction | Often missing | SDK example only |
| Streaming response handling | Yes, with client-side cancellation | Often buffered | Whatever the SDK example does |
| Semantic cache for similar queries | Yes, embedding-based with tunable threshold | Not in scope | Not in scope |
| Fallback provider configuration | Yes, normalized prompt and response | Single provider | Single provider |
| Token budgets and rate limits | Yes, per-user and global ceilings | Not in scope | Not in scope |
A worked example: integrating an AI summarizer
Suppose the brief is: add a meeting-notes summarizer to an existing Next.js application. Run /cortex-integrate and the output is the wrapper plus the supporting middleware.
// lib/ai/summarize.ts
import { streamText } from 'ai';
import { gateway } from '@ai-sdk/gateway';
import { semanticCache } from '@/lib/ai/cache';
import { tokenBudget } from '@/lib/ai/budget';
import { providerFallback } from '@/lib/ai/fallback';
export async function summarizeMeetingNotes(
userId: string,
notes: string,
signal?: AbortSignal,
) {
await tokenBudget.check(userId, { estimatedInputTokens: notes.length / 4 });
const cached = await semanticCache.lookup({
namespace: 'meeting-summary',
query: notes,
threshold: 0.93,
});
if (cached) return cached;
const result = streamText({
model: providerFallback('anthropic/claude-sonnet-4-6', {
fallback: 'openai/gpt-4o-mini',
}),
system: SUMMARIZE_SYSTEM_PROMPT,
prompt: notes,
maxOutputTokens: 800, // hard ceiling per request
abortSignal: signal,
experimental_telemetry: { isEnabled: true },
});
// Stream to client, accumulate for cache write
const accumulator: string[] = [];
for await (const chunk of result.textStream) {
accumulator.push(chunk);
}
const final = accumulator.join('');
await tokenBudget.commit(userId, {
inputTokens: await result.usage.inputTokens,
outputTokens: await result.usage.outputTokens,
});
await semanticCache.write({
namespace: 'meeting-summary',
query: notes,
value: final,
ttl: 86400,
});
return final;
}The wrapper enforces the budget before the call, checks the semantic cache, runs through the fallback-aware provider, caps the output tokens, streams the response, and commits the actual token usage to the budget after the call completes. Each layer is testable in isolation and operationally observable. The wrapper is the integration; the rest of the application calls summarizeMeetingNotes and treats it like any other async function.
Related skills
/cortex-integrate covers the integration layer. For prompt design, versioning, and evaluation, /cortex-prompt is the right call. For ML model evaluation and drift detection, /cortex-eval produces the analysis. For an inventory of an existing ML/AI stack, /cortex-recon is the right entry point.
Install
/cortex-integrate ships with the Cortex agent in the Tonone for Claude Code package. Install Tonone, invoke /cortex-integrate from any Claude Code session, and the skill produces a production-grade LLM integration calibrated to the project's stack and provider.
1. Add to marketplace
2. Install Cortex
Production LLM features are the prototype plus six layers. The skill is built so the layers are the default, not the cleanup pass that gets scheduled after the bill arrives.
Frequently asked questions
- What does /cortex-integrate do?
- It wraps an LLM call with the layers that make the integration production-ready: retry with exponential backoff, streaming response handling, semantic cache, fallback provider, token budgets per user and globally, and rate limiting per identity.
- How is /cortex-integrate different from using an SDK directly?
- SDKs provide the call. The production concerns (retry, cache, fallback, budgets, rate limits) are the integration's responsibility. /cortex-integrate produces all of those as the default rather than leaving them as future work.
- When should I use /cortex-integrate?
- When adding an LLM-powered feature to a real product, or when hardening an existing prototype to production standards. Skip it for one-off scripts where the production layers are overhead.
- What providers does /cortex-integrate support?
- Vercel AI Gateway is the default with provider-prefixed model strings (Anthropic, OpenAI, Google, etc.). Direct provider SDKs are supported when the project already uses them.
- Does /cortex-integrate handle streaming?
- Yes. Streaming is the default for user-facing routes because the perceived latency is meaningfully better. The wrapper also handles client-side cancellation so an aborted request does not waste tokens.
- How do I install /cortex-integrate?
- Install Tonone for Claude Code via the get-started guide at tonone.ai/get-started. /cortex-integrate ships with the Cortex agent and is invoked as a slash command in any Claude Code session. Tonone is free and MIT-licensed.
- Is /cortex-integrate free?
- Yes. The skill is part of Tonone, which is MIT-licensed. The only cost is Claude Code token usage during the work plus the LLM tokens used by the feature in production.
- Does /cortex-integrate prevent prompt-injection cost attacks?
- It bounds the impact: hard per-request output ceilings prevent a malicious prompt from generating very large completions, token budgets prevent a single user from racking up cost, and rate limits prevent a single user from dominating the queue.