How to Review Code Written by AI

Not "AI is bad/good," but what specifically to look for in a PR written with Claude/Copilot/Cursor, what to hand off to the machines, and where the review process breaks down under the flood of generated code.

What's inside (30-second version):

Why the old model — "an experienced reviewer reads a 200–400 line PR" — falls apart when AI dumps 5–10x more code in the same day

The 5 spots to look at first: the contract at the seams, methodology conformance, edge cases, behavior tests, imports

What to hand to the machine: style, imports, ArchUnit, spec-as-code

A layered process: pre-commit → AI skills in CI → comparison against the spec → the human last, for 10 minutes of meaningful discussion

5 anti-patterns and a printable 15-item checklist

This process has been running in a backend cluster team of 20+ engineers for a year and a half. Under a related publication about methodology + AI, Alexey Tolmachev (Senior BSA, 14+ years) wrote in the comments: "6× fewer defects after formalizing context boundaries before code generation."

The article is a follow-up to "AI writes the code. So why bother with methodology?". That one discussed the shared context without which AI hands you a different version of the truth every session. This one is about the concrete process of reviewing code that — with that context or without it — has already been written and landed in a PR.

Why ordinary code review doesn't work here

The old model: one of the experienced developers reads a 200–400 line PR, leaves 5–10 comments, the author fixes them, we merge. Daily volume — 2–4 PRs per reviewer.

With AI tools the same team dumps 5–10x more code into PRs in the same day. Not because the developers are bad — because Claude or Cursor genuinely speed things up. And that's where the problems begin, the ones the old model was never designed for:

Volume. The reviewer physically can't keep up with the reading. If you spend the same amount of time on every line, review becomes the bottleneck — and the team starts merging without it faster than anyone notices.
A confident tone ≠ correctness. AI writes code in the calm voice of "this is how it's done." The more powerful the model, the more convincing the hallucination sounds. On code review that means: your eyes slide easily over correct-looking code that actually doesn't work, or doesn't work the way it should.
Context isn't preserved between sessions. Every PR is written from a blank slate — AI doesn't remember that in the last PR the team agreed to use a particular approach to nullable values. Without explicit rules on top of AI you get code that is locally correct but incompatible with the rest of the codebase.
Coverage skews toward the happy path. AI writes the happy path beautifully. Edge cases (null, empty collection, concurrent access, an error further down the stack) are handled worse, because they're underrepresented in the training data too.

The conclusion isn't "cancel AI" but change the review model to fit the new flow.

What to look for first

In code written with AI, I look at the following spots in a fixed order. If the first one is broken — there's no point going further, rewrite the whole thing.

The contract at the seams

The API signature, the public methods of domain services, the events on the bus — the things visible to neighboring services and the team. AI loves to tweak the contract just slightly "for convenience": rename a field in a DTO, change the return type from an optional wrapper to null, replace void with bool for a "convenient check." Such changes silently break everything on the outside.

What I check:

The method signature matches what's described in the OpenAPI/AsyncAPI/specification
Field names in the JSON match the spec (kebab-case URL, camelCase JSON in the typical stack)
Returned errors are declared and handled by exactly those who should handle them
If there's a spec-as-code (see the Use Case specification) — the behavior in the code matches the "Commands" / "Queries" section

Methodology conformance

In our case — Use Case Pattern and its maturity levels. Without explicit skills, AI often slides back to "the way it's usually done in the training data," which on average is a fat service layer with business logic in one monolithic class.

What I check:

The Controller contains no business logic, only mapping and dispatch (see Level 2)
Business rules are concentrated in the Handler, not smeared across several layers
At Level 3 — invariants live in the aggregate, not in the Handler (see Tactical Patterns)
Events are published in the same transaction as the persist (Outbox, not "after save we called publish")
Persistence goes through typed generated queries, not through some random object from the domain layer straight into the controller

Edge cases

The most typical place where AI stumbles. I mentally walk through 4 scenarios:

Null / missing data. What if the store returned "not found"? What if an incoming parameter is missing?
Empty collection. What if items.isEmpty()? Iterating over an empty list gives the right answer, but sometimes 0 leads to a division by zero or to "successfully" creating an empty order.
Concurrent access. What if two requests change the same aggregate at once? Is there protection against a lost update?
An error further down the stack. The payment gateway returned 500. What happens — Retry? Circuit Breaker? Transaction rollback?

If the code has no explicit checks and tests for these 4 spots — the PR needs more work, even if the happy path works.

Tests

The trickiest category. AI generates tests that look like tests but don't actually verify behavior. The typical hallucinations look like this:

@Test
void shouldCreateOrder() {
    var order = service.create(uc);
    assertNotNull(order);            // checked that the object isn't null
    verify(repository).save(any()); // checked that the method was called
}

import (
    "context"
    "testing"
)

func TestCreateOrder(t *testing.T) {
    svc, repo := newTestService(t)
    order, err := svc.CreateOrder(context.Background(), CreateOrderInput{
        CustomerID: "c-1",
        Items:      nil,
    })
    if err != nil {
        t.Fatal(err)
    }
    if order == nil { // checked that the object isn't nil
        t.Fatal("expected order, got nil")
    }
    if !repo.saveCalled { // checked that the method was called
        t.Fatal("expected Save to be called")
    }
}

it('should create order', async () => {
    const order = await service.createOrder(input);
    expect(order).not.toBeNull();        // checked that the object isn't null
    expect(repository.save).toHaveBeenCalled(); // checked that the method was called
});

def test_create_order():
    service, repository = build_test_service()
    order = service.create_order(CreateOrderInput(customer_id="c-1", items=[]))
    assert order is not None            # checked that the object isn't None
    repository.save.assert_called()     # checked that the method was called

This is not a test of business logic, it's a "test that the code didn't crash." A real test should verify:

That order has the right field values (matching the input)
That save was called with specific arguments (not any()), and at the right moment (for example, after the balance check)
That the event was published — which one exactly, with which fields
That on invalid input the expected exception is thrown, not just any exception

Imports and dependencies

The most frequent site of AI hallucinations — invented methods and packages. Popular frameworks have "plausible" method names that don't actually exist.

What I check:

All dependencies resolve (the IDE will show this, but AI often writes code in a "prompt" without an IDE)
The store query methods exist in the version of the library in use
Library versions match those wired into the dependency manifest

What NOT to review by hand

The main rule: if a machine can check it, a machine checks it. Human attention is expensive; spend it on what the machine can't do.

I hand off to the machines:

Style and formatting — checkstyle, golangci-lint, prettier, ruff
Imports — goimports, organize-imports, isort
Boilerplate — code generation or language features (value objects, equals, toString)
Local variable names — if the scope is ≤ 10 lines, the reviewer's taste isn't critical here; AI writes them fine
Trailing whitespace, line endings — git hooks
Architectural invariants — ArchUnit, dependency-cruiser or equivalent (the controller doesn't call the repository, the core doesn't import infrastructure, etc.)

If the machines already catch this — none of these remarks should appear in the PR review. If they do, configure pre-commit or CI properly; don't burden the human.

How to scale review

Volume isn't beaten by "a more attentive reviewer." You need a layered process in which the human joins last, not first.

Layer 1 — pre-commit hooks locally

The developer commits → a local run of the AI skills fires on the specific changed files. If something is violated, the commit is blocked until it's fixed.

Upside: the developer sees the remarks before opening the PR. No round of "reviewer wrote — developer fixes — review again."

Downside: it has to be fast (< 5 seconds). Don't run a full analysis on pre-commit, only the diff.

Layer 2 — AI skills in CI on every PR

When a PR is created — Claude Code skills (ucp-pattern-review, ucp-api-review, ucp-ddd-tactical-review, etc.) run over the diff and comment automatically. Each remark cites the rule code (for example, JS-2.5, BR-C5, R-7) and a link to the methodology article.

This doesn't block the merge — these are recommendations. The team decides what to accept. But the human reviewer no longer writes 80% of the "trivial" remarks — they've either been accepted or reasonably rejected by the developer.

A ready-made set of skills for Use Case Pattern — github.com/remodov/usecase-pattern-skills.

Layer 3 — comparison against the specification

If you keep the Use Case specification in git next to the code — you can automatically check that the code implements what's in the spec. Commands from the "Commands" section map to controllers, business rules from "Business Rules" have tests, events from "Domain Events" are published.

A divergence is a bug caught in the PR review, not a "we'll get to it later."

Layer 4 — human review

By this point, what reaches the human is already:

Code cleaned of style and formatting issues
With methodology rules checked automatically
With tests that passed a first AI review
With conformance to the spec

The human focuses on what the machine can't do yet: global consistency (does the solution fit the architecture as a whole), business meaning (is this actually what the business needs?), trade-offs (was the best of the options chosen?).

10 minutes on a serious PR in this mode is not "read 800 lines and said OK," but a focused discussion of three or four architectural points.

Anti-patterns of reviewing AI code

They arise naturally — nobody teaches the opposite, and under load the team slides toward convenient shortcuts.

"Everything works — let's merge." Tests are green, it runs locally — so it's OK. In reality only the happy path was checked. Edge cases, invariants, concurrent scenarios — nobody looked. A time bomb.

"I don't know this chunk, I trust the AI." AI wrote an integration with a new library, the reviewer doesn't know the library, merges on trust. A month later it turns out AI invented half the API, and it only works in the happy-path test — which AI also wrote. The most dangerous pattern — it grows in proportion to the model's power.

"Test later." AI generates the code, the developer is rushing to ship, defers the test. By the time they come back — the context is forgotten, they write the test against what's currently in the code, not against what should be in the code. The test becomes a regression against the bugs instead of a guard against them.

"AI also refactored along the way." AI got the task "add a field to the DTO" and while it was at it rewrote three neighboring classes "to make them prettier." The PR is a 600-line diff, 20 lines of which relate to the task. The reviewer either spends 2 hours or merges as is. Fix it at the prompt stage — teach the team to give AI narrow tasks.

"One big PR per sprint." AI writes fast — the temptation is to roll a whole feature into one commit. That breaks the review cycle by volume. Split it up, the way you would with manual writing — one logical unit = one PR.

Printable checklist

Drop this into the team's docs/code-review-ai.md:

## Before opening a PR (developer)
- [ ] Pre-commit run of AI skills passed with no blocking remarks
- [ ] All edge cases (null, empty collection, concurrent access, error) have tests
- [ ] The test verifies BEHAVIOR, not "didn't crash"
- [ ] The PR contains exactly one logical unit — no "refactored along the way"

## During review (reviewer)
- [ ] The contract at the seams matches the spec
- [ ] The Controller contains no business logic
- [ ] Transactions are placed where they belong, and not where they don't
- [ ] Persistence goes through typed queries, not direct access from the controller
- [ ] Events are published in the same transaction as the save
- [ ] At Level 3 — invariants in the aggregate, not in the Handler
- [ ] Dependencies resolve, versions match (no invented API)
- [ ] Tests verify specific values, not any() / notNull
- [ ] AI skills passed in CI
- [ ] Comparison against the specification matched (if applicable)

## Before merge
- [ ] All blocking remarks are closed
- [ ] Non-blocking ones — accepted or explicitly justified
- [ ] Architecture tests are green

15 items, split into three phases: before the PR, during review, before merge. The team sees the checklist and knows what's expected of them.

What's next

AI tools are a throughput boost. Without a process that handles that volume, growing the team runs into accumulated code-quality problems. The combination of methodology + spec-as-code + AI skills for review holds the quality bar as volume grows.