Skip to main content
Back to the field guide

A field guide to the /proof-e2e skill

AI End-to-End Test Builder (Playwright)

Most AI tools write E2E tests that break on the first UI refactor. /proof-e2e covers user journeys with stable selectors, proper waits, and CI integration that runs on every merge.

Proof · QA & Testing11 min readMarch 28, 2026

End-to-end tests have a reputation problem. Half the engineering teams who tried them at one point gave up and removed them. The reason is almost always the same: the tests were brittle, the selectors broke on every UI tweak, the suite took twenty minutes to run, the CI signal was unreliable, and after a few months of false alarms the team started adding .skip to the failing tests instead of fixing them. The tests stopped catching real regressions because nobody trusted them, and the suite got deleted in a cleanup PR with the apologetic title "remove flaky e2e tests." That sequence is so common that many teams treat E2E coverage as a hazard rather than an asset, which means the most important user flows in the product, the ones that have to work for the company to make money, are the ones running without automated coverage.

The reputation problem is fixable. The teams who run E2E suites successfully do a few specific things differently: they test user journeys rather than implementation details, they use selectors that survive UI refactors, they wait for the right things rather than sleeping for arbitrary durations, and they run the suite on every merge so failures are caught immediately rather than discovered weeks later. The discipline is well-known. It is also the kind of discipline that mainstream AI coding tools fail to apply, because they generate tests that mirror the implementation instead of the user behavior. The /proof-e2e skill is built around the discipline: it writes tests that survive UI changes because the tests describe what the user is doing, not how the page is currently structured.

Why generalist AI ships flaky tests

Ask Cursor or ChatGPT to write a test for a sign-up form, and you get a test that selects the form by class name or by the structure of the DOM tree, fills in the inputs by their position, and asserts on the visible text. The test passes when the page is exactly as it was when the test was written. It fails the moment a designer changes a class name, the moment the form gets wrapped in an additional container div, the moment the assertion text gets a typo fix. The brittleness is not in the model's understanding; it is in the heuristic the model uses for selectors. Generalist tools default to the most specific selector available because that is what is most visible in the DOM at the moment of generation. Specific selectors are the ones that break first.

The other failure mode is the wait pattern. A test that needs to wait for an async operation to complete should wait for the *result* of the operation: a specific element appearing, a network response landing, a state transition completing. Generalist tools generate await page.waitForTimeout(2000) because that is the most visible pattern in their training data. The arbitrary timeout works on the developer's laptop, fails on a slow CI runner, and produces the kind of flake that erodes trust faster than any other single bug. A test that randomly fails one in twenty runs is worse than no test at all, because it teaches the team to ignore failures.

What durable E2E testing actually requires

Durable E2E tests have four properties. First, they describe user journeys, not implementation. A sign-up test verifies that a new user can create an account, see the welcome screen, and reach the first feature; it does not verify that the email field is the second input or that the submit button has class btn-primary. Second, they use selectors that survive UI refactors. The two reliable strategies are role-based selectors (Playwright's getByRole('button', { name: 'Sign up' })) and explicit data-testid attributes added to the markup. Both are stable because they describe semantic intent rather than current structure. Third, they wait for the right things, never for arbitrary durations. Fourth, they run on every merge, with CI integration that fails the build on a real failure but is robust to the kind of network blips that should not block a deploy.

The harder part of durable E2E is choosing what to test. Not every flow needs E2E coverage; the cost-benefit only works for the flows that are critical to the product. The right shape of an E2E suite is a small number of tests covering the highest-value journeys: sign-up, log-in, the core feature activation, the checkout flow if there is one, the data export if it exists. Each of those tests is a complete journey from a user's first action to a verifiable outcome. Tests that cover edge cases of individual components belong in unit or integration tests; tests that cover "the homepage renders" belong in smoke tests. E2E is for the journeys, and the suite stays maintainable because the journeys are few.

How /proof-e2e works

Step one: identify the critical journeys

Before writing any tests, /proof-e2e reads the application to identify the critical user journeys. It looks for the entry points (sign-up, log-in, OAuth flows), the activation moment (the first time a user reaches the core feature), the conversion moment (checkout, paid plan upgrade), and the data flows that carry product value (data export, integrations). The list of journeys becomes the test plan. The skill is opinionated about keeping this list small: between five and fifteen journeys for most products, with explicit reasoning for why each one is on the list.

Step two: write tests with stable selectors

Each journey gets a test written with role-based selectors as the primary strategy. Where a role-based selector is ambiguous, the skill adds a data-testid attribute to the markup and uses that. The selector strategy is documented in the test file so the next person to extend the suite knows the convention. Tests that need to interact with elements that have no obvious role (custom components without ARIA roles) get the data-testid treatment plus a comment explaining why. The discipline of selector choice is what makes the suite survive UI refactors; without it, every CSS change becomes a test maintenance task.

Step three: the right waits

Every async operation in a /proof-e2e test waits for a specific event. A form submit waits for the network response to complete and the success state to render. A navigation waits for the new page's primary content to appear. A loading state waits for the loading indicator to disappear, not for an arbitrary number of milliseconds. The skill never generates waitForTimeout calls; the wait is always for a verifiable condition. This is the difference between a test that passes consistently and a test that flakes once a week.

Step four: CI integration

The test suite ships with CI configuration that runs the tests on every merge to the main branch and on every pull request, with parallelization tuned to the suite size, retries on network blips (with a low retry count to avoid masking real flakes), and artifacts (screenshots, videos, traces) attached to failed runs so debugging is fast. The CI configuration is generated for the platform the project uses (GitHub Actions, GitLab CI, CircleCI, Buildkite). The skill also generates a Playwright config with the right browser matrix and the right timeout settings, calibrated to the journey list so the suite finishes in a reasonable wall-clock time.

Retries hide flakes. /proof-e2e configures retries for genuine network failures (typically retry once on a 5xx from an external service) but does not retry on selector failures or assertion failures, because retrying those masks the real issue. A test that needs three retries to pass is broken, not flaky.

Tonone's /proof-e2e skill writes E2E tests for the critical user journeys with stable selectors, event-based waits, and CI integration that runs on every merge.

When to use /proof-e2e, and when not to

/proof-e2e is the right call when the critical user journeys have no automated coverage and a regression in any of them would be visible to customers. The signal is when the team is doing manual regression testing before releases, when a recent incident traces back to a flow that should have been tested, or when a launch is approaching and the test bar is low. The skill is also the right call when an existing E2E suite is unreliable and the team is considering deleting it; running /proof-e2e against the suite produces a refactor plan that converts brittle tests to durable ones, often by replacing selectors and waits.

Skip the skill for unit-level testing (the right tool there is the project's existing unit framework) and for performance testing or load testing, which are different disciplines with different tooling. For API contract testing without a UI, /proof-api is the right call. For an audit of an existing test suite to find the broken tests, /proof-audit produces a triage report.

CapabilityTononeGeneralist chatbotCursor / Copilot
Tests user journeys, not implementationYes, journey-first designMirrors current DOM structureSuggests within current line context
Stable selectors (role-based or testid)Yes, role-first with testid fallbackCSS selectors that break on refactorDefaults to most-specific selector
Event-based waits, no waitForTimeoutYes, waits for verifiable conditionsOften arbitrary timeoutsOften arbitrary timeouts
CI integration generatedYes, GitHub/GitLab/CircleCI configs includedTests only, no CITests only, no CI
Critical journey list calibrated to productYes, 5-15 journeys with reasoningTests whatever the prompt namesTests whatever the prompt names

A worked example: testing a sign-up journey

Suppose the brief is: add E2E coverage for sign-up. Run /proof-e2e and the output starts with the journey definition, then produces the test.

typescript
// e2e/journeys/sign-up.spec.ts
import { test, expect } from '@playwright/test';

test.describe('sign-up journey', () => {
  test('a new user can create an account and reach onboarding', async ({
    page,
    request,
  }) => {
    const email = `test+${Date.now()}@example.com`;

    await page.goto('/');
    await page.getByRole('link', { name: 'Sign up' }).click();

    await page.getByLabel('Email').fill(email);
    await page.getByLabel('Password').fill('correct horse battery staple');
    await page.getByRole('button', { name: 'Create account' }).click();

    // Wait for the verification screen, not for an arbitrary duration
    await expect(
      page.getByRole('heading', { name: 'Check your email' })
    ).toBeVisible();

    // Use the test helper to read the verification token from the
    // mailcatcher fixture (set up in the global setup)
    const { token } = await (
      await request.get(`/__test__/last-verification?email=${email}`)
    ).json();
    await page.goto(`/auth/verify?token=${token}`);

    // Verify the user lands on the onboarding screen
    await expect(
      page.getByRole('heading', { name: 'Welcome' })
    ).toBeVisible();
    await expect(
      page.getByText("Let's get you set up")
    ).toBeVisible();
  });
});

Every selector is role-based. Every assertion waits for a specific element. The test interacts with the same surface a user does (clicking links, filling forms, navigating) rather than poking at the application's internals. When the marketing team renames "Sign up" to "Get started," the test needs one update. When a redesign moves the form into a modal, the test needs zero updates because it does not care where the form lives. That is what "survives UI refactors" actually looks like in test code.

/proof-e2e covers user journeys. For API contract testing without a UI, /proof-api produces test suites that verify endpoint shapes, error responses, and pagination. For an audit of existing tests to find the broken or low-value ones, /proof-audit produces a triage report with deletion candidates and refactor priorities. For test strategy at the project level, /proof-strategy decides which test types belong where.

Install

/proof-e2e ships with the Proof agent in the Tonone for Claude Code package. Install Tonone, invoke /proof-e2e from any Claude Code session inside the project, and the skill produces a journey-first E2E suite with the CI integration the project uses.

1. Add to marketplace

$ claude plugin marketplace add tonone-ai/tonone

2. Install Proof

$ claude plugin install proof@tonone-ai

Durable E2E coverage is rare not because the discipline is hard to learn, but because the discipline is hard to apply consistently across a growing suite. The skill is built so the discipline is the default, which is the only way the suite stays useful over time.

Frequently asked questions

What does /proof-e2e do?
It writes E2E tests for the critical user journeys in the application, using role-based selectors, event-based waits, and CI integration that runs on every merge. The default framework is Playwright; Cypress and Selenium are also supported.
How is /proof-e2e different from a generalist AI writing tests?
A generalist writes tests that mirror the current DOM and use arbitrary timeouts, which break on the first UI refactor and flake under CI load. /proof-e2e identifies the critical journeys, uses stable selectors, and waits for verifiable conditions.
When should I use /proof-e2e?
When critical user flows have no automated coverage and a regression would be visible to customers. Also when an existing E2E suite is unreliable and the team is considering deleting it.
What frameworks does /proof-e2e support?
Playwright by default, Cypress and Selenium as alternatives. The skill detects which framework the project uses and matches it. For projects without an existing framework, Playwright is the default recommendation.
Does /proof-e2e generate CI configuration?
Yes. The skill generates the CI configuration for the platform the project uses (GitHub Actions, GitLab CI, CircleCI, Buildkite), with the right parallelization, retry policy, and artifact uploads on failure.
How do I install /proof-e2e?
Install Tonone for Claude Code via the get-started guide at tonone.ai/get-started. /proof-e2e ships with the Proof agent and is invoked as a slash command in any Claude Code session. Tonone is free and MIT-licensed.
Is /proof-e2e free?
Yes. The skill is part of Tonone, which is MIT-licensed. The only cost is Claude Code token usage during the work.
How many tests does /proof-e2e generate?
Between five and fifteen journey-level tests for most products, with explicit reasoning for each one. The skill is opinionated about keeping the suite small so it stays maintainable; component-level tests belong in unit or integration suites.

Pairs well with