Notes

This demo script explains how an agentic development product should present its value.

The point is not to perform "AI writes code quickly." The point is engineering flow: task input, context reading, change execution, verification feedback, and a clear handoff back to the human.

Demo rhythm.

The first half shows how a small change is decomposed. The second half shows how browser checks, tests, and the final report become a trustworthy delivery.

Make the workflow clear with numbers and contrasts.

Talking about "engineering rigor" in the abstract isn't convincing, so for this part I use one real, small change as a yardstick and lay out the scale of all four stages.

6

files changed

incl. 2 tests

14

context reads

files + terminal + browser

9

checks run

half lint, half unit tests

Profiling the scale of one mid-sized change

Beyond scale, what's more worth looking at is where the time goes. Line up the four stages side by side and you'll see the real time sink was never "writing code" — it's reading context and verifying.

Task input6human-written
Context reading22↑ heaviest
Change execution11
Verification feedback18running tests

s · Time per stage for one task (median)

Not every task should run the same flow. Sort them by risk and verifiability and you'll know which ones you can let go of and which ones need a human watching.

High risk · high verifiabilityTests have your back — change freely, verify to close out.
High risk · low verifiabilityAdd observability first; don't rush to touch it.
Low risk · high verifiabilityAutomate first; hand it off to the Agent in bulk.
Low risk · low verifiabilityDocs and renames — a quick human glance is enough.
Task classification: risk × verifiability

After classification comes choosing the model — the task's difficulty decides which tier to use, not reaching for the most expensive one every time.

Model pricingas of 2026-06
ModelInputCachedOutputContext
codex-minilow-risk batch2.00.58.0256K
codex-stdeveryday changes9.02.236.0512K
codex-procross-file refactors22.05.588.01M

¥ / million tokens · Match the model to task difficulty

Models keep iterating, and so does this flow. Here are a few of the milestones that brought it to where it is today.

  1. 2026-02read-only contextunderstand first, then speak
  2. 2026-04replayable evidence chainterminal + browser join the flow
  3. 2026-06verification-driven deliverycurrent

How the workflow's capabilities evolved

One sentence to wrap up this section.

Stitch the whole flow together and it's about pushing one task all the way to a verifiable change.

TaskVerifiable change
Task → change: a verifiable development flow