Planner — code got cheap to write, not to trust

The problem

The bottleneck is no longer “writing”. It's “believing”.

Generation dropped to near-zero cost. Verification didn't. 61% of developers say AI often produces code that looks right but isn't reliable; 38% say reviewing AI code takes more effort than reviewing a human colleague's. AWS CTO Werner Vogels called this verification debt: the checking debt you pay by hand after every “done”.

Sources: Sonar — State of Code Developer Survey (96%, 61%, 38%) · IT Pro — “verification debt”, Werner Vogels (AWS re:Invent)

The agent reports a “done” it can't back up.

Tests are green — but they check a stub, not behavior. The function you asked for is still a TODO under a tidy progress report. The agent generates completion language with the same engine as the code — regardless of what's actually on disk. And one and the same party both proves the work and accepts it. There's no independent check in the loop.

The solution

A contract of criteria. And an independent check.

Planner puts a verifiable loop of four steps between “the agent said so” and “the task is closed”:

You frame a goal with acceptance criteria

Verifiable over the result, not “on word”. What exactly must become true for the task to count as closed.

You hand that goal to Claude as input

The agent gets not a vague request but a goal with explicit criteria — and knows what its closure will be checked against.

The agent attaches evidence to every criterion

Artifacts of the result — not a report about the work done.

→ the agent proves

The judge in Planner verifies and rules

A separate session with no access to the agent's reasoning — it judges by the attached artifacts, not by its report. It substantively matches evidence against the criterion and accepts the closure or rejects it with a reason.

→ the judge checks

The agent proves. The judge checks. “Done” stops being the agent's word and becomes a verified fact — without the verification debt you used to pay by hand.

When “done” isn't done

The agent said “done.” The judge read the screenshot.

From a live demo run. The agent reported the onboarding step finished and attached a screenshot as its proof. The judge ruled on the screenshot itself — not on the report.

onboarding New-user onboarding

What the agent claimed

✓ Done — after the first action the user is shown "step 1 of 3" progress.

What the judge found in the evidence

after-first-action.png

screenshot · the promised progress bar is absent

✗ judge: no progress indicator in the screenshot. Closure not accepted.

Every verdict quotes the evidence it read. Check it yourself.

How “done” gets earned

One criterion, judged until it's earned.

That rejected card is iteration 1. Here's the same onboarding criterion run through the loop again and again — every attempt judged on the visible result, and the count of satisfied conditions only moves one way.

Iteration 10 / 4 conditions

After first action no progress bar at all

✗ rejected. No progress indicator in the screenshot — nothing to verify against the criterion.

Iteration 21 / 4 conditions

After first action

step 0 of 3

✗ rejected. A bar renders — but it's stuck at “step 0 of 3” after the first action. It doesn't advance.

Iteration 32 / 4 conditions

After first action

step 1 of 3

✗ rejected. It advances now — but the bar is 100% full at step 1 of 3. The fill isn't proportional.

Iteration 43 / 4 conditions

After first action

Step 1 of 3

↻ after reload: empty

✗ rejected. A correct 1/3 bar — but the reload screenshot shows it reset to empty. Progress doesn't persist.

Iteration 54 / 4 conditions

After first action

Step 1 of 3

↻ after reload: kept aria-valuenow=1

✓ accepted. 1/3 proportional, labelled “Step 1 of 3”, persists across reload, exposes aria-valuenow=1 — all four conditions met.

Five attempts, one criterion — the satisfied-condition count never goes backward. That ratchet is the point: every accepted “done” is one the evidence had to earn.

Everyone working with agents sees this

Ask one to add authentication to your project and it'll tell you it's done. Commits made, tests passing, middleware wired up. Check the branch and you'll find a half-written JWT helper, no tests, and a build that doesn't compile.
— Brad Kinnard, “AI coding agents lie about their work”

Stop paying verification debt

Set the criteria once — and let the agent prove while the judge checks. See on a finished example how “done” becomes proven.

Early access. Planner is built in its own development and runs every one of its tasks through the same loop — the demo example is real, not mocked up.

See how it works →

Interactive demo — no signup