Skip to main content
Back to the field guide

Meet Proof

The AI QA Engineer for E2E and Integration Testing

Tonone's Proof designs test strategies, builds Playwright and Cypress suites focused on user journeys, creates API test suites, and triages flaky tests.

Proof · QA & Testing10 min readApril 4, 2026

Test suites grow the way codebases grow: by accretion. Someone writes a test for a bug fix. Someone else adds a Playwright test for a checkout flow that was breaking in staging. A third engineer adds integration tests for the new API endpoints. Over two years, the suite has eight hundred tests, a CI run that takes forty-seven minutes, and a dozen tests that fail intermittently for reasons nobody has been able to reproduce consistently. The team has responded rationally to each new test failure: retrying flaky tests automatically, skipping tests that are too hard to fix, adding new tests that duplicate coverage that already exists somewhere else. The result is a test suite that provides psychological comfort but not engineering confidence. It takes a long time to run, it fails unpredictably, and when a real bug gets to production, the post-mortem reveals that the test for that exact path was technically present but was relying on an implementation detail rather than a user behavior. An ai qa engineer does not just add more tests to a system like this, it audits the system, fixes the strategy, and then builds the coverage that actually catches bugs before users do.

Why the generalist approach breaks down

Ask a generalist chatbot to write a Playwright test and it will write one. Ask it to write a test for a checkout flow and it will write a test that clicks buttons and asserts text content, and that test will be brittle from the moment it is written. It will select elements by text content that changes when the copy is updated, by CSS classes that change when the UI is redesigned, or by DOM position that changes when the component is refactored. The generalist does not know about page objects, test isolation, meaningful assertions at the user-behavior level rather than the implementation level, or the difference between a test that verifies the user completed checkout successfully and a test that verifies the order confirmation page contains the string "Thank you for your order." Those are different tests with dramatically different robustness, and generalist tools produce the second kind because it is what a quick scan of the page HTML suggests.

Cursor and GitHub Copilot have a structural problem with testing: they complete test patterns based on the surrounding code, not based on the user behaviors that matter. They will autocomplete a describe block, suggest a beforeEach setup, and fill in expect assertions that match the function signature visible in context. What they do not do is ask what the test is actually for, what user scenario it represents, what failure it is designed to catch, and whether it duplicates coverage that already exists somewhere in the suite. Autocomplete produces tests that are syntactically correct and that match the existing patterns in the codebase. It does not produce a test strategy: an assessment of which parts of the system have the highest failure risk, which test types provide the best coverage-to-cost ratio at each layer, and where the current test suite has gaps that actual users will find before the CI suite does.

The most expensive testing failure mode is not missing tests, it is wrong tests. Tests that pass consistently but test the wrong thing create false confidence: the team sees a green build and ships, and the user finds the bug the tests were designed to prevent. This is more common than it should be because test quality is not visible in CI output. Every test that passes looks identical to every other test that passes. The only way to distinguish a test that reliably catches real bugs from a test that only verifies its own implementation assumptions is to read it carefully against the user behavior it is supposed to protect. Proof does that work upfront, before the test is written, and produces tests that are designed to stay meaningful as the codebase evolves.

What a QA engineer actually does

On a disciplined engineering team, a QA engineer is the person who makes testing a first-class engineering concern rather than a box to check before shipping. They design the test strategy: for this project, given its risk profile and team size, here is the right distribution across unit, integration, and E2E tests. They write E2E tests that exercise user journeys, not UI states or function calls, using page object models that isolate the test logic from the selector details, so a UI change does not require rewriting every test that touches the affected component. They build API test suites that verify contract behavior and catch the regressions that happen when an API is changed without updating the consuming code. They triage flaky tests by treating them as the engineering debt they are, tracking down the real source of non-determinism rather than wrapping them in retry logic that masks the problem. Their goal is a test suite that is small enough to run quickly, stable enough to trust completely, and comprehensive enough that a green build means the team can ship with confidence.

Meet Proof

Proof is Tonone's QA engineer, the specialist agent for test strategy, Playwright and Cypress E2E suites, API test suites, coverage auditing, and flaky test triage. Proof's working philosophy is that tests should cost less to maintain than the bugs they prevent, and a test that is not catching real bugs is paying negative returns. Proof builds E2E tests around user journeys rather than implementation details, using page object models so selectors are encapsulated and changes do not cascade through the whole suite. It designs API tests that enforce contracts and catch the regressions that matter, and it audits existing suites to find the tests that are costing more than they are protecting.

Tonone's Proof builds E2E test suites around user journeys, not implementation details, using page object models that stay stable when the UI changes and assertions that catch real bugs rather than verifying their own assumptions.

What Proof actually does

Designing the right test strategy

The proof-strategy skill produces a test strategy document tailored to the specific project, not a generic "test pyramid" recommendation, but a concrete assessment of which test types provide the best coverage-to-cost ratio for this codebase's risk profile. Proof reads the architecture, identifies the critical user paths (the flows where a failure would have the most user impact), assesses the current test coverage distribution, and produces a strategy that answers: how many unit tests, how many integration tests, how many E2E tests, and which specific scenarios warrant E2E coverage versus lower-level test coverage. The strategy includes the tools best suited for each layer (Vitest or Jest for unit tests, Supertest or native Node.js test runner for API integration, Playwright or Cypress for E2E based on the stack and team preferences), the CI configuration that runs the right tests at the right stages (fast unit tests on every push, full E2E suite on pull requests to main), and a prioritized list of the tests that should be written first based on where the highest-risk coverage gaps are. For teams that have "testing" on the roadmap but cannot decide where to start, proof-strategy is the document that makes the starting point unambiguous.

Building Playwright E2E suites for user journeys

The proof-e2e skill writes Playwright (or Cypress) test suites using page object models, proper test isolation, and assertions that test user-observable behavior rather than implementation details. Each page object encapsulates the selectors and interactions for a page or component, so when the UI changes, only the page object needs to update, not every test that touches that page. The test files describe scenarios in terms that a non-engineer can read and understand: user signs up, confirms email, completes onboarding, and sees the dashboard. Assertions verify that the user reached the expected state, not that specific DOM elements contain specific strings. Setup and teardown are handled with fixtures that create and clean up test data properly, so tests run independently and in any order without leaving behind state that causes subsequent tests to fail or flake. Proof also configures the Playwright project file with the right browser targets, retry settings for known flaky CI timing issues, and screenshot/video capture on failure so failed E2E tests produce debugging artifacts rather than just a red status. The output is a test suite designed to stay maintainable as the codebase evolves, not one that requires constant updates every time the UI changes.

Building API test suites for contract enforcement

The proof-api skill builds API test suites that verify contract behavior, the inputs the API accepts, the outputs it produces, the error conditions it handles correctly, and the authentication and authorization rules it enforces. These are not tests of the implementation but tests of the contract: if the API says it returns a 400 with a specific error shape for a missing required field, the test verifies that contract, so any future change that breaks it is caught before it reaches a consumer. Proof writes API tests that cover the happy path, the common error paths (missing fields, invalid types, unauthorized access, not-found resources), and the edge cases with outsized user impact (concurrent requests that should produce a conflict, idempotent operations that should return the same result on retry). The tests are structured for speed, they hit the API directly, not through the browser, and are configured to run in CI against a test database that is seeded with known state, so every test starts from a predictable baseline. For teams whose API is consumed by multiple clients (web, mobile, third-party integrations), contract tests become the safety net that lets the API evolve without breaking consumers.

Tonone's Proof builds API test suites that enforce contracts rather than testing implementations, so API changes that break consumers are caught in CI before they reach production.

Auditing coverage gaps and flaky tests

The proof-audit skill examines an existing test suite and produces a structured findings report covering three dimensions: coverage gaps (the user paths and API contracts that have no test coverage and represent real risk), flaky tests (the tests that fail intermittently, with an assessment of the likely cause and the recommended fix), and test quality (tests that pass consistently but are testing the wrong thing, verifying implementation details rather than user behavior, or using selectors that will break on the next UI update). The audit report is prioritized by impact: fixing a flaky test that runs in every CI pipeline and fails 20% of the time is worth more engineering time than writing a new test for a low-traffic edge case. For teams whose CI suite is a source of uncertainty rather than confidence, the audit is often the more valuable investment than adding new tests. Identifying and fixing five flaky tests and rewriting three brittle E2E tests can do more for CI reliability than adding fifty new tests on top of the existing problems.

Reconnaissance before testing

The proof-recon skill performs a thorough assessment of the project's testing landscape before any tests are written or strategy is designed. It reads the existing test files to understand the current coverage distribution, identifies the testing framework and libraries in use, locates the CI configuration to see how tests are currently run and whether they are blocking deployments, and maps the critical user paths based on the application's feature set and routing. Recon also surfaces the structural issues that affect test quality at scale: test files that import from non-test code in ways that make them fragile, shared mutable state between tests that causes order-dependent failures, database fixtures that are not cleaned up between test runs. These findings inform the test strategy and every subsequent Proof output, without recon, a test strategy risks recommending a new approach that duplicates the problems already present in the existing suite.

A worked example

A team needs a Playwright E2E test for their checkout flow, from adding an item to the cart through completing payment and seeing the order confirmation. They ask Proof to write it. Rather than writing a flat test that clicks through the UI, Proof produces a page object model structure with proper setup, teardown, and user-journey assertions:

typescript
// tests/e2e/checkout.spec.ts, generated by Proof
import { test, expect } from '@playwright/test';
import { CartPage } from './pages/CartPage';
import { CheckoutPage } from './pages/CheckoutPage';
import { OrderConfirmationPage } from './pages/OrderConfirmationPage';
import { seedTestUser, cleanupTestUser } from './fixtures/users';

let testUserId: string;

test.beforeEach(async ({ page }) => {
  // Create a fresh test user with a funded test card for each test run
  const user = await seedTestUser({ withPaymentMethod: true });
  testUserId = user.id;
  await page.goto('/login');
  await page.fill('[data-testid="email"]', user.email);
  await page.fill('[data-testid="password"]', user.password);
  await page.click('[data-testid="login-submit"]');
  await expect(page).toHaveURL('/dashboard');
});

test.afterEach(async () => {
  await cleanupTestUser(testUserId);
});

test('user completes checkout and receives order confirmation', async ({ page }) => {
  const cart = new CartPage(page);
  const checkout = new CheckoutPage(page);
  const confirmation = new OrderConfirmationPage(page);

  // Add item to cart
  await cart.navigateTo();
  await cart.addItem({ productId: 'test-product-001', quantity: 2 });
  await expect(cart.itemCount).toHaveText('2');

  // Proceed through checkout
  await cart.proceedToCheckout();
  await checkout.fillShippingAddress({
    name: 'Test User',
    line1: '123 Test St',
    city: 'San Francisco',
    state: 'CA',
    zip: '94102',
  });
  await checkout.selectShippingMethod('standard');
  await checkout.confirmPayment(); // uses pre-seeded test card

  // Assert user-observable outcome: order was placed and confirmed
  await expect(page).toHaveURL(/\/orders\/[a-z0-9-]+\/confirmation/);
  await expect(confirmation.heading).toContainText('Order confirmed');
  await expect(confirmation.orderNumber).toBeVisible();
  await expect(confirmation.estimatedDelivery).toBeVisible();
});

The test uses page objects that encapsulate selectors so any UI change requires updating only the page object file, not every test that uses the page. The beforeEach creates a fresh test user with a payment method seeded through the test fixture system rather than clicking through the UI setup flow, which would be slow and a source of flakiness. The assertions verify user-observable outcomes (order was confirmed, order number is visible, estimated delivery is visible) rather than implementation details like specific text strings that would break if copy was changed. The afterEach cleans up the test user so the test database stays clean and tests remain independent. This is the structure Proof applies to every E2E test it produces, the structure that stays maintainable two years after it was written.

If your CI suite has flaky tests, run proof-audit before adding any new tests. Flaky tests are engineering debt that compounds: each new test added on top of a flaky foundation is one more test that may fail for non-deterministic reasons, and each retry added to mask the flakiness is one more minute added to the CI run. Fix the existing suite first, then build new coverage on a stable foundation.

Proof vs the alternatives

Proof is not competing with testing frameworks, it generates the tests that run in them. The comparison below focuses on the approaches most commonly used to produce test coverage in practice: generalist chatbots that write tests on demand, code completion tools that autocomplete test patterns, and hand-written Playwright tests produced by engineers who are not specialists in testing.

Tonone's Proof designs test strategies tailored to each project's risk profile, not a generic test pyramid, but a concrete assessment of which test types provide the best coverage-to-cost ratio for this codebase.

CapabilityTononeGeneralist chatbotCursor / Copilot
Test strategy for the project's risk profileYes, concrete strategy with layer distribution, tool selection, and CI configurationNo, produces individual tests without strategy thinkingNo, completes test patterns, does not assess strategy
E2E tests with page object modelsYes, full page object structure, fixtures, setup/teardown, stable selectorsPartial, produces flat tests with brittle selectorsPartial, autocompletes patterns but not the full page object structure
API contract test suitesYes, happy path, error paths, auth rules, idempotency, edge casesPartial, produces basic tests without contract framingPartial, completes test assertions without contract coverage assessment
Flaky test triage and root causeYes, identifies source of non-determinism, recommends structural fixNo, no debugging capability for test non-determinismNo, no test audit capability
Coverage gap identificationYes, maps critical user paths to existing coverage, finds real gapsNo, no coverage analysis capabilityNo, completes tests in context but does not assess overall gaps
Tests that stay maintainable over timeYes, page objects, user-behavior assertions, isolated fixturesNo, produces tests that break when UI or implementation changesLimited, depends on the patterns visible in the surrounding code

Install and try

Tonone is free and MIT-licensed. Install it once and all 23 agents, including Proof, are available in your Claude Code session. Run proof-recon on any project to get a clear picture of the current test landscape before deciding where to invest in coverage.

1. Add to marketplace

$ claude plugin marketplace add tonone-ai/tonone

2. Install Proof

$ claude plugin install proof@tonone-ai

Frequently asked questions

What does Tonone's Proof do?
Proof is the AI QA engineer in the Tonone team for Claude Code. It designs test strategies for the project's risk profile, builds Playwright and Cypress E2E suites with page object models and proper isolation, creates API test suites for contract enforcement, audits existing suites for flaky tests and coverage gaps, and produces tests that stay maintainable as the codebase evolves.
How does Proof make Playwright tests less brittle?
Proof uses page object models that encapsulate all selectors for a page or component in a single file. When the UI changes, only the page object needs updating, not every test that uses the page. Proof also writes assertions that test user-observable outcomes rather than specific DOM content, so copy changes and layout changes do not break the test.
What is a test strategy and why do I need one?
A test strategy is a concrete plan for how a project's testing coverage is structured: which test types (unit, integration, E2E) cover which parts of the system, which tools are used at each layer, how tests are run in CI, and where the highest-priority coverage gaps are. Without a strategy, teams accumulate tests without a coherent structure, producing coverage that is uneven, slow to run, and unreliable to trust.
How does Proof handle flaky tests?
Proof treats flaky tests as engineering debt with a root cause that needs to be found and fixed, not masked with retries. The proof-audit skill analyzes flaky test patterns to identify the source, shared mutable state, timing dependencies, missing cleanup, order-dependent fixtures, and recommends structural fixes that eliminate the non-determinism rather than working around it.
Can Proof write API tests for GraphQL or gRPC?
Yes. Proof writes API test suites for REST, GraphQL, and gRPC APIs, covering contract enforcement at the protocol level relevant to each. For GraphQL, this includes query shape, error responses, and resolver behavior. For REST, it covers HTTP semantics, status codes, error shapes, and auth enforcement.
Is Tonone's Proof free to use?
Yes. Tonone is MIT-licensed and free to use. Proof is one of 23 agents included in the Tonone package for Claude Code. You pay only for Claude Code token usage during the testing work itself.
How does Proof handle test data and database state?
Proof uses fixture functions that create test data programmatically before each test and clean it up after, so tests run independently without leaving state that affects subsequent tests. This is the primary structural fix for order-dependent test failures, tests that pass in isolation but fail when run after another test that left conflicting database state.
What is the difference between proof-e2e and proof-api?
proof-e2e builds browser-level end-to-end tests using Playwright or Cypress, exercising user journeys through the full application stack. proof-api builds API-level test suites that test HTTP endpoints directly, covering contract behavior, error handling, and auth rules without a browser. Both use proper isolation and fixture management, but at different layers of the test pyramid.

Pairs well with