Most ML models that get prototyped never make it to production, and most that do make it to production are running off scripts that nobody can reproduce. The reason is the gap between a notebook and a pipeline. A notebook is a sequence of cells that ran once on the analyst's laptop. A pipeline is a versioned, scheduled, reproducible system: ingestion that handles new data, feature engineering that runs the same way at training and inference, training that produces an artifact you can compare to previous artifacts, evaluation that catches regressions, and a serving layer that exposes the model with the same monitoring everything else in the stack has. Bridging that gap is the work that ML practitioners call MLOps, and it is the work that gets skipped because the notebook produced a result and the result felt like the deliverable.
The skip is paid back the first time the model needs to be retrained, the first time the team wants to compare two model versions, and the first time the model's accuracy degrades silently in production because nobody is watching. Each of these recovery moments is more expensive than building the pipeline correctly the first time, and the cumulative cost is why teams that take ML seriously build the pipeline early. The /cortex-model skill is built to produce that pipeline as the default rather than the cleanup pass: data ingestion with validation, feature engineering with a feature store, training with cross-validation and hyperparameter tuning, evaluation against a held-out test set, and deployment to a serving endpoint with monitoring.
Why generalist AI ships notebooks instead of pipelines
Ask Cursor or ChatGPT to build a classifier for your data. You get a notebook. The notebook reads the data with pandas, splits it into train and test, trains a model, prints accuracy. The notebook is correct as a prototype; it is not a pipeline. The data ingestion is hardcoded to the local CSV path. The feature engineering is inlined into the training cell, so it cannot run at inference time without copy-pasting. The model is saved with an unversioned blob serializer. The evaluation is a single accuracy number with no held-out set, no cross-validation, no comparison to a previous version. There is no serving layer. There is no monitoring. The notebook works for the demo and breaks the moment somebody tries to use the model for anything real.
The deeper issue is that ML in production is the work the notebook represents only ten percent of. The other ninety percent is the engineering around the model: the feature store that ensures train/serve consistency, the experiment tracking that lets the team compare model versions, the deployment pipeline that ships the model to a serving endpoint, the monitoring that alerts when accuracy or input distribution drifts. A generalist tool produces the ten-percent prototype because that is what the prompt looks like. /cortex-model produces the ninety percent that makes the prototype useful.
What an end-to-end pipeline requires
A useful ML pipeline has six layers. First, data ingestion: the pipeline reads from the source of truth (data warehouse, event stream, application database) and validates the data before training. Second, feature engineering with a feature store: features are computed once and reused at both training and inference, so the model is not silently broken by train/serve skew. Third, training: cross-validation, hyperparameter search, with experiment tracking so the team can compare model versions deliberately. Fourth, evaluation: a held-out test set, with the metrics calibrated to the actual business outcome (precision/recall trade-offs that match the cost of false positives and false negatives). Fifth, deployment: a serving layer that exposes the model with versioning, with the option to A/B test new versions against the current one. Sixth, monitoring: input distribution drift detection, prediction distribution monitoring, and accuracy tracking against ground truth as it becomes available.
Each layer is its own discipline. Skipping any of them creates a specific failure mode: skip ingestion validation and the model gets trained on bad data, skip the feature store and you get train/serve skew, skip evaluation rigor and you ship regressions, skip monitoring and you discover the silent decay only when a customer reports a bad prediction. Building all six together is the discipline; doing it cheaply is what /cortex-model is built for.
How /cortex-model works
Step one: characterize the problem
When invoked, /cortex-model asks for the problem in concrete terms: what is being predicted, what input data is available, what success looks like, what the cost of a false positive vs false negative is. The answers determine the model type (classification, regression, ranking, anomaly detection), the evaluation metrics, and the architectural decisions (do we need a feature store, does the inference need to be sub-100ms, does the training need to retrain weekly or quarterly).
Step two: ingestion and feature engineering
The skill produces the data ingestion layer (Airflow, Dagster, Prefect, or the project's existing orchestrator) with validation rules that catch malformed inputs before training. Feature engineering is implemented in a feature store (Feast, Tecton, or a project-specific approach) so the same feature definitions run at training and inference. The discipline is to make the features reusable across model versions rather than reimplementing them per-model.
Step three: training and evaluation
Training uses cross-validation with the right number of folds for the dataset size. Hyperparameter search uses Bayesian optimization or grid search depending on the search space. Experiment tracking (MLflow, Weights & Biases, or the project's existing tool) records every run with the hyperparameters, the dataset version, the metrics, and the artifact. Evaluation uses a held-out test set with metrics calibrated to the cost of errors: precision-at-k for ranking, F1 for classification with imbalanced classes, MAE/RMSE for regression with outlier sensitivity considered.
Step four: deployment and monitoring
The model is deployed to a serving endpoint with versioning so a new model can be deployed alongside the current one and traffic can be shifted gradually. Monitoring covers input distribution (Kolmogorov-Smirnov test on each feature), prediction distribution (changes in the histogram of predictions), and accuracy as ground truth arrives. Alerts fire when any metric crosses a threshold so the team catches the silent decay before the customer reports it.
Train/serve skew is the most common production ML bug and the hardest to debug. /cortex-model uses a feature store so the same feature code runs in both contexts; this single decision prevents most of the bugs that catch teams the first time they put a model in production.
Tonone's /cortex-model skill builds end-to-end ML pipelines: data ingestion with validation, feature engineering with a feature store, training with cross-validation and hyperparameter tuning, evaluation, deployment, and monitoring.
When to use /cortex-model, and when not to
/cortex-model is the right call when building a prediction, classification, or regression model from labeled data for the first time and the team wants a complete pipeline rather than a notebook. The skill is also the right call when an existing model is running as a script and needs proper versioning, evaluation infrastructure, and a serving layer.
Skip the skill for LLM-powered features (use /cortex-prompt for prompt design and /cortex-integrate for production integration). For pure exploratory data analysis without a deployment target, a notebook is fine. For evaluation of an existing production model (drift detection, accuracy tracking), /cortex-eval is the right call.
| Capability | Tonone | Generalist chatbot | Cursor / Copilot |
|---|---|---|---|
| Reproducible data ingestion | Yes, orchestrated and validated | Hardcoded local paths | Not in scope |
| Feature store for train/serve consistency | Yes, prevents skew | Inline feature code | Not in scope |
| Cross-validation and hyperparameter search | Yes, calibrated to dataset | Single train/test split | Not in scope |
| Experiment tracking | Yes, MLflow / W&B integration | No tracking | Not in scope |
| Serving + monitoring with drift detection | Yes, by default | Unversioned blob in a folder | Not in scope |
A worked example: churn prediction pipeline
Suppose the brief is: build a churn prediction model. Run /cortex-model and the output is the pipeline plus the supporting artifacts.
# pipelines/churn/train.py (excerpt)
from feast import FeatureStore
import mlflow
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier
def train(run_name: str, label_query: str):
fs = FeatureStore(repo_path='features')
# 1. Load training data + features from the feature store
labels = read_warehouse(label_query)
features = fs.get_historical_features(
entity_df=labels,
features=[
'customer_features:days_since_signup',
'customer_features:invoice_failure_count_30d',
'customer_features:plan_tier',
# ...12 more
],
).to_df()
X, y = features.drop('churned', axis=1), features['churned']
# 2. Cross-validation with hyperparameter search (Optuna)
best_params = optuna_search(X, y, n_trials=50)
# 3. Train final model on full set with best params
model = GradientBoostingClassifier(**best_params)
cv_scores = cross_validate_with_calibration(model, X, y, cv=StratifiedKFold(5))
# 4. Held-out test set evaluation
test_metrics = evaluate_on_holdout(model, holdout_set='2026-Q1')
# 5. Track to MLflow with versioned artifact
with mlflow.start_run(run_name=run_name):
mlflow.log_params(best_params)
mlflow.log_metrics(test_metrics)
mlflow.sklearn.log_model(model, 'model', registered_model_name='churn-v3')
return model
# pipelines/churn/serve.py (excerpt)
# Serves the registered model via FastAPI with feature lookup
# from the same feature store, so train/serve features match.
# Monitoring sidecar logs feature distributions for drift detection.The pipeline is reproducible. The features come from the same feature store at training and inference. The training is tracked in MLflow so the team can compare runs. Drift monitoring runs alongside serving. When the model needs to be retrained, the pipeline runs the same way as the original training; when a new version needs to ship, the model registry handles versioning. That is what crossing from notebook to production looks like.
Related skills
/cortex-model builds the pipeline. For LLM-powered features, /cortex-prompt covers prompt design and /cortex-integrate covers production integration. For evaluation of an existing model, /cortex-eval produces the drift and accuracy reports.
Install
/cortex-model ships with the Cortex agent in the Tonone for Claude Code package. Install Tonone, invoke /cortex-model from any Claude Code session, and the skill produces the end-to-end pipeline calibrated to the project's data and serving infrastructure.
1. Add to marketplace
2. Install Cortex
ML pipelines that survive contact with production are the ones that did the engineering work upfront. The skill is built so that work is the default, not the cleanup.
Frequently asked questions
- What does /cortex-model do?
- It builds an end-to-end ML pipeline: data ingestion with validation, feature engineering with a feature store, training with cross-validation and hyperparameter search, evaluation against a held-out test set, deployment to a serving endpoint, and monitoring with drift detection.
- What model types does /cortex-model support?
- Classification, regression, ranking, and anomaly detection. The skill picks the right model family based on the problem characterization (data shape, success criteria, error costs).
- How is /cortex-model different from a generalist building a model?
- A generalist produces a notebook. /cortex-model produces the engineering layer around the model: orchestrated ingestion, feature store, experiment tracking, deployment with versioning, and monitoring with drift detection.
- When should I use /cortex-model?
- When building a prediction or classification model from labeled data for the first time and you want a production pipeline rather than a notebook. Also when an existing model needs proper versioning, evaluation, and serving infrastructure.
- What feature stores does /cortex-model support?
- Feast (open-source, recommended for greenfield), Tecton (managed), and project-specific approaches when those are already in use. The skill matches the existing tool rather than imposing a new one.
- How do I install /cortex-model?
- Install Tonone for Claude Code via the get-started guide at tonone.ai/get-started. /cortex-model ships with the Cortex agent and is invoked as a slash command in any Claude Code session. Tonone is free and MIT-licensed.
- Is /cortex-model free?
- Yes. The skill is part of Tonone, which is MIT-licensed. The only cost is Claude Code token usage during the work plus the compute cost of training and serving.
- Does /cortex-model handle drift detection?
- Yes. Deployed models include input distribution monitoring (Kolmogorov-Smirnov test on each feature) and prediction distribution monitoring, with alerts when distributions shift beyond a threshold.