Accepting AI output: acceptance criteria and tests from the spec

← Back to the section

AI produces code that looks right. That's exactly the danger. It compiles, tests are green, you clicked through it — works. So you close the task. Two weeks later it turns out that an empty cart still creates an order, a double-click charges twice, and a "payment error" quietly turns into a success response. The code wasn't broken — it was plausible. It did the wrong thing, but did it confidently.

Acceptance is the phase where you separate "done" from "looks like done". And it is a separate piece of work, not a side effect of "well, I read the diff".

Acceptance is not code review

They're easy to confuse, because both happen over the same PR. But they look in different directions.

Code review asks "is this written well". Contract at the seams, edge cases, behavioural tests, methodology fit — that conversation lives in «Reviewing code written by AI». Review looks inward: are the layers clean, is there any hallucinated API, is the transaction in the right place.

Acceptance asks "is this even the right thing built". It doesn't care whether the Handler is elegant if it implements the wrong operation. You can write code that's flawless by every rule and still solves the adjacent problem. Code review will pass it — everything's clean. Acceptance is obliged to reject it.

The difference is practical. Review can be partly delegated to linters and AI skills — code quality is machine-checkable. Acceptance cannot be fully delegated: the reference you compare the result against doesn't live in the code. It lives in the contract.

The reference for acceptance is the contract, not the code

In the previous phase — «From a product slice to a UCP contract» — you turned the smallest valuable slice into a contract: boundaries (what's in, what's explicitly not), acceptance criteria (observable conditions for "done"), interfaces at the seams. That is the reference. Acceptance is checking the AI's result against that document, item by item.

Three axes to check:

Are the acceptance criteria met? Each criterion is a checkable statement about behaviour. "Paying for an order with insufficient balance leaves the order in pending, the payment does not go through, the user sees a clear error." Not "payment works", but a concrete observable condition. Walk the list — each item is either confirmed by a test or you look at it by hand.
Are the boundaries respected? AI loves to do things "while it's here". You asked for payment — it added cancellation because it's "logical nearby". Anything past the slice boundary is not a bonus; it's unrequested behaviour that you now have to support and that nobody designed. Reject it.
Do the interfaces at the seams match? Signatures, field names, error codes, event shape — exactly what the contract records. A silent field rename "for convenience" breaks everything outside.

The key shift: you're comparing not "the code against how I pictured it", but the code against a document written before generation. No contract, no acceptance — only "feels about right", which is precisely the state in which plausible code passes.

Where AI is plausibly wrong

A plausible error is one you can't see on the happy path. AI is strong on the success scenario: it's the overwhelming majority of its training data. It errs where the data was thinner — and exactly where the error doesn't catch the eye.

Edge cases. Empty collection, missing data, zero, range boundary. An empty order is created "successfully", a division by zero hides behind the average of an empty list, findById returned "not found" and the code carried on with null. The happy path, meanwhile, is flawless.

Error handling. The most typical plausible lie. An external call failed — and the catch block returned an empty result or a default. Formally "handled", in fact the error was swallowed. Response 200, plausible body, and a silent failure. Check not "is there a try-catch", but "what exactly happens in the failure branch": a typed error is thrown, the transaction rolls back, the retry is idempotent — or a stub is quietly returned.

Hidden requirements. What's in the contract but wasn't repeated in the prompt because "it's obvious". Idempotency of a repeated payment. A permission check: can this user even cancel this order? Consistency of the event and the record in one transaction. AI doesn't infer the implicit — it fills gaps with the average of its sample. If a requirement isn't stated in the contract and isn't checked, treat it as absent.

What all three share: they pass the demo. That's why "I clicked through, it works" is not acceptance but its first and weakest layer.

Why tests come from the contract, not from the agent's code

If AI wrote both the code and the tests for it, the test verifies that the code does what the code does. A tautology in green. It pins down current behaviour, bugs included, and defends them at every refactor. Such a test is a regression against correct behaviour, not a guard against incorrect behaviour.

A test has value only when its reference is independent of the implementation. The source of that reference is the contract: acceptance criteria turn directly into test cases. "An order with insufficient balance stays in pending" is a ready-made test, written before the agent touched the code. It checks behaviour from the spec, not the shape of the agent's code.

Hence the practice: acceptance criteria are formulated before generation and turned into tests before acceptance. Then "green tests" means "the contract's behaviour holds", not "the code doesn't crash". The gap between those two meanings of green is the gap between acceptance and its imitation. The formalized, versioned reference that makes this checkable is the theme of «Executable engineering standard»: a rule a machine can check is worth ten rules in someone's head.

What by hand, what by autotest

The split is simple: what can be expressed as a checkable condition goes to an autotest; what requires judgement goes by hand.

Cover with autotests:

Every acceptance criterion phrased as an observable condition.
Edge cases from the contract: null, empty collection, boundary, concurrent access.
Failure branches: an external service is down — check that exactly what was intended happens.
Seams: response shape, error codes, event fields — with concrete values, not "not null".

Look by hand where there's no objective reference:

Slice boundaries — did AI do extra. A machine doesn't know what's past the boundary; you do.
Meaning, not shape — does this solve the original product problem, not the adjacent one.
Hidden requirements you might not have written into the contract. Acceptance is the last moment to catch and add them.
Plausibility of the failure response — does something that's actually a swallowed error look like "success".

This only works with an honest reference. No contract, and acceptance collapses into "looks about right", and plausible code passes precisely because it's plausible. The contract is what turns acceptance from a feeling into a check.

In short

The danger of AI code isn't that it crashes, but that it looks right while doing the wrong thing.
Acceptance ≠ code review. Review is about code quality (inward). Acceptance is about "is it the right thing" (checking against the contract).
The reference for acceptance is the contract, written before generation: acceptance criteria, boundaries, interfaces. No contract, no acceptance.
AI is plausibly wrong in edge cases, failure branches, and hidden requirements — where the error is invisible on the happy path.
Tests come from the contract, not the agent's code — otherwise they defend bugs instead of catching them.
Autotest for everything checkable; by hand for boundaries, meaning, and hidden requirements.