All posts

Perspectives

After the agent, before the review: where AI code tooling misses the moment

A category argument for pre-merge integrity checks on agent-written code, and the eight-criterion proof-of-usefulness methodology we used to test our rules against three production-shape open-source projects.

After the agent, before the review: where AI code tooling misses the moment cover image

Here is a small reading list of things real AI coding agents have produced for real engineers in the last few weeks:

from incidentlib.ai import summarize
panic("TODO: implement privileged container enforcement")
// @ts-ignore
return schema.parse(input);
try:
    session.commit()
except:
    pass

None of these look wrong. Each one passes the visual review a tired engineer does on a Tuesday afternoon. Each one survives the existing toolchain. And each one is the kind of mistake an AI coding agent is more likely to produce than a human.

incidentlib does not exist on PyPI. The privileged container check ships with a panic in its hot path. The type error the agent could not solve was suppressed instead of fixed. The bare except swallows the database failure that costs you customers next quarter.

We built Shipmoor Community CLI for the moment between “the agent finished” and “I’m asking a human to review this.” This post is the longer argument for why that moment is a category, not just a feature, and how we tested whether our solution is actually useful before we called it shipped.

The shape of the gap

Modern development workflows have a lot of tools at each phase. Walk through a day:

You write code or accept an agent diff. Your editor formats it, your linter underlines unused imports, your type checker tells you about a mismatched return. You commit. A pre-commit hook runs more linters and maybe a quick test. You push. CI runs your full test suite, a vulnerability scan against your dependency manifest, maybe a SAST job that takes 15 minutes. You open a PR. A code-owner bot pings reviewers. A reviewer reads the diff. You merge.

Every tool above is solving a real problem at a real moment. Together they cover style, syntax, types, tests, vulnerabilities, exploits, ownership, taste. What none of them are built to ask is: did the agent that wrote this make the kind of mistake an agent makes?

That moment, right after the agent generates and right before a human reviews, has different characteristics than any of the moments around it:

  • The code compiles. The agent doesn’t ship broken syntax. Linters and type checkers have less to say.
  • The dependencies are not CVEs. The agent invented a package; vulnerability scanners check whether real packages are dangerous, not whether the package even exists.
  • The defect is structural. Bare except. Empty function body. as any over a real type error. None of these are security holes you can pattern-match with semgrep rules built for SQL injection.
  • The author is fast. The agent is willing to write ten thousand lines without slowing down. A reviewer who treats every diff as a generated diff is going to burn out in a week.
  • The reviewer is a finite resource. If the agent’s mistakes are predictable, the cost of catching them before the reviewer touches the diff is low.

The tools we already have were built for code humans write. They are excellent at that job. The defects we are talking about are different because the author is different.

What the existing toolchain is actually for

It is worth being specific. Each tool below is excellent at what it does. Each one is the wrong shape for this moment.

Linters. ruff, ESLint, gofmt, Prettier, golangci-lint. They fire on every save and they catch style, formatting, simple anti-patterns. They will tell you import sys is unused. They will not tell you from incidentlib.ai import summarize does not resolve because incidentlib does not exist on PyPI. That is not the question a linter is built to answer.

Type checkers. mypy, pyright, tsc, the Go compiler. They check whether the types the code claims are internally consistent. When the agent writes // @ts-ignore over a real type error, the type checker dutifully ignores the error. The type checker is being honest; the suppression is the defect.

Vulnerability scanners. pip-audit, npm audit, govulncheck, Dependabot, Snyk, Renovate. They cross-reference your declared dependencies against CVE databases and reach out when something needs patching. They have nothing to say about an undeclared dependency. If the agent’s import does not appear in your requirements.txt or package.json, the vulnerability scanner has no input.

SAST and code-scanning. semgrep, CodeQL, Snyk Code. These find security defects: injection, traversal, deserialization, broken auth. They are excellent at that. They are not built to ask “is this function body empty” or “does this Go module path resolve.”

PR-comment bots. GitHub Copilot Code Review, Sourcery, Reviewpad, the hundred internal “AI reviewer” bots. They fire during review, on the PR. They are competing for the same attention as the human reviewer. The whole point of pre-merge integrity checks is to keep agent-shaped defects from getting that far.

Code formatters. gofmt, black, Prettier. Cosmetic only. Out of scope.

There is genuine overlap at the edges. Type checkers and linters will catch some of the obvious agent failures. Vulnerability scanners will eventually catch a hallucinated package if you somehow added it to your manifest. But the moment they fire is not the moment the agent finished. The defect class they target is not the defect class agents specialize in. The cost of running the heavy ones is not compatible with “I want to know in under a second whether this diff is worth sending to a human.”

The gap is real. It has a shape. The shape is: a fast, local, diff-scoped check whose rule set is built around the failure modes of generated code. Nothing in the current toolchain is built to occupy that shape.

Our angle

We wrote a free local command-line tool. We made three deliberate constraints.

Scope to the diff, not the tree. Full-tree scans on mature codebases are noisy: every old as any in a generic library, every pass in an alembic downgrade(), every fmt.Println in a CLI tool that legitimately prints. The first time an engineer runs Shipmoor on their own production repo, they should not see hundreds of pre-existing findings against code the agent didn’t touch. We made --changed the default workflow. We added --diff <range> for CI and --patch <file> for “an agent handed me a diff I haven’t applied yet.”

Catalog the defects agents make. We do not try to find every bug. We try to find these bugs, repeatably, across Python, TypeScript, JavaScript, and Go:

  • Phantom imports. Hallucinated packages. Real packages used without being declared. Local files that do not exist. Local modules that do not match any source root.
  • Placeholder logic. pass bodies. Ellipsis bodies. throw new Error("not implemented"). panic("TODO"). Constant returns where the agent gave up.
  • Trust suppression. any at exported boundaries. as any casts. @ts-ignore / @ts-expect-error over real type errors.
  • Quality signals. Bare except. Mutable defaults. fmt.Print and console.log left in production code. God-tier function lengths.
  • Control flow. Unreachable statements after return or throw.

Thirty rules across four languages. Each one targets a failure mode we have personally watched coding agents produce in real PRs.

Be honest about what we cannot do, by design. No account. No telemetry. No source upload. No shell-rc edit during install. No background daemon. One binary you run on demand. The only outbound network call is an optional PyPI/npm registry lookup to tell a hallucinated package apart from a real-but-undeclared package, and SHIPMOOR_OFFLINE=1 turns it off.

We also deliberately do not do a lot of things other tools do better: we are not a style linter, not a vulnerability scanner, not a SAST tool. The conviction is that a focused tool with a small, specific catalog of defects is more useful in this particular moment than a general-purpose analyzer.

How we measured whether it actually works

Building a tool that emits findings is easy. Building one a working developer would choose to run again tomorrow without being told to is harder. We put the question in writing before we shipped:

Is the Shipmoor Community CLI useful enough that a working developer would run it again tomorrow without being asked?

We then turned that question into a measurable protocol. Eight pass criteria, six stages, three real production-shape open-source projects. The full PRD is one document. The two run summaries (original at v0.1.1, re-run at v0.2.0 after fixes landed) are two more. They live in the repo. Anyone can re-run the protocol against the same projects with the same script and get the same numbers.

The corpora

We did not test against fixtures we wrote ourselves. We cloned three real projects at --depth 1:

LanguageProjectWhy this one
GoSonlis/kube-admission-controllerReal Kubernetes admission controller. Small focused codebase, real k8s.io/* dependencies, real go.mod, real deployment YAML.
Pythonechoboomer/incident-botReal multi-service incident-management bot. FastAPI-shaped backend, React frontend, alembic migrations, dockerfile, the works. ~68 Python files.
TypeScriptkonstructio/metaphorReal Next.js reference app from the kubefirst team. Production-shape pages/, components/, redux/, with tsconfig.json, GitLab CI, ArgoCD workflows.

The methodology made one further commitment: for each project, plant exactly one agent-style file that an actual agent might plausibly produce. Then run the protocol against that staged change. The point was to test the changed-mode workflow we pitch on, against real code, with a defect every reviewer would recognize as agent-style.

The eight criteria

The PRD listed eight pass criteria. We committed to all eight before running the protocol. We did not move the goalposts after the numbers came back.

  1. Time to first scan under 60 seconds, no docs required.
  2. No crashes, no tracebacks, no partial output.
  3. Determinism: identical fingerprints across consecutive runs.
  4. Planted defects caught: at least 80% across the three projects.
  5. Stage 2 (changed-mode) actionable rate at least 60%.
  6. Stage 2 noise rate at most 15%.
  7. Patch ↔ changed parity on the same logical change.
  8. No P0 friction items.

The interesting number was always going to be the noise rate. A linter that flags everything is not useful. A linter that flags nothing is not useful. The hard problem was finding the rule shape that catches the planted defects while staying quiet on the legitimately-different patterns mature codebases use.

The original run was not pretty

We ran the protocol against v0.1.1 of the CLI. The verdict, copy-pasted from proof-of-usefulness/summary.md:

Answer: Not yet. Conditionally yes after four P0 fixes.

Stage 2 (changed-mode) was acceptable: 85% actionable rate, 15.4% noise rate, all planted defects caught except one (the Go phantom-module rule did not exist yet). Stage 1 (full-tree) was unusable: 863 findings on incident-bot (~97% noise), 18 findings on metaphor (~94% noise). Patch mode failed the parity criterion on metaphor: scanning a unified diff produced two extra high-severity blocks against import React from 'react' and ../typography that changed-mode correctly resolved.

The friction inventory recorded four P0 items:

  1. Python phantom_import did not consult requirements.txt or pyproject.toml. Every declared dep showed up as a high-severity phantom block.
  2. TypeScript / JavaScript phantom_dependency did not try .ts / .tsx / .js / .jsx / /index.* extensions when resolving relative imports. Every import './foo' whose target is foo.tsx looked phantom.
  3. The control_flow.unreachable_code analyzer treated a return inside an inner arrow function as terminating the module. Every export default Component; after a React component flagged as unreachable.
  4. Patch mode did not consult the working directory’s package.json when resolving imports inside the patch, so a normal Next.js patch blocked on import React from 'react'.

We filed every one of these as a public backlog ticket. We filed every P1 and P2 finding as well: 17 tickets total, sorted by priority, each one with problem statement, evidence, required fix, and acceptance criteria. We did not paper over the original run.

The re-run did the work

After the four P0 fixes landed, we re-ran the exact same protocol against the same three projects with the same planted scenarios. Same script. Same artifacts directory. Different verdict.

Criterionv0.1.1 baselinev0.2.0 re-run
1: Time to first scanPass (all three under 0.5 s)Pass
2: No crashes / partial outputPass (2 cosmetic warnings)Pass (warnings gone)
3: DeterminismPassPass
4: Planted defects caught (≥ 80%)Pass (11/12, 92%)Pass (13/13, 100%)
5: Stage 2 actionable rate (≥ 60%)Pass (85%)Pass (100%)
6: Stage 2 noise rate (≤ 15%)Borderline (15.4%)Pass (0%)
7: Patch ↔ changed parityFail (metaphor +2 FPs)Pass (identical fingerprints on all three)
8: No P0 friction itemsFail (4 P0 items)Pass on the original four

The numbers that matter most for the question we asked:

  • 13 of 13 planted agent-style defects caught in changed-mode across the three projects.
  • 0% noise rate in changed-mode. Every Stage 2 finding mapped to an actual planted defect.
  • metaphor full-tree scan: 18 findings (94% noise) became 5 findings (0% noise). Every finding is on the planted file.
  • incident-bot full-tree scan: 863 findings became 615, and then with the follow-up nested-manifest fix (P0-16) became 72 findings in our smoke test. That is a 92% reduction; the residual is mostly truly missing manifest entries and real broken paths, not false positives.
  • kube-admission-controller changed-mode: 3 findings became 5 because the new go.phantom_import rule fired twice on the planted file. Once for the hallucinated github.com/example/notreal/policy. Once for k8s.io/client-go/kubernetes, which is a real package the planted “agent” imported without adding to go.mod (exactly the mistake we want to catch).

The re-run did one more thing: it surfaced new problems that the original noise had been masking. A nested-manifest discovery bug. A handful of false positives on commented-out imports. Both became new public backlog tickets. The point of the protocol was never to declare victory; it was to make the work visible.

Three live runs you can replicate

The launch post walks through three live changed-mode scans against well-known projects: Flask, Zod, and Cobra. Read it for the full terminal output. The relevant shapes:

Flask. Five findings on the planted change. Two of them are python.phantom_import at high severity, and crucially they carry different subtypes: incidentlib is flagged as hallucinated_package (“does not exist on PyPI”). sqlalchemy is flagged as missing_manifest_entry (“imported but not declared in requirements.txt or pyproject.toml”). The agent invented one and borrowed the other. The reviewer can tell at a glance which is which.

Zod. Four findings on the planted change. The hallucinated fake-zod-extra is flagged. The legitimate relative import ./index (Zod’s own root) is not flagged. Across 406 sibling TypeScript files in the monorepo, the only findings are on the file the planted “agent” wrote. The monorepo discovery walks the 15 nested package.json files Zod’s workspace structure produces and uses each one in its own resolution context.

Cobra. Four findings on the planted change. go.phantom_import resolves the hallucinated github.com/example/notreal/completion against go.mod, the standard library, vendored paths, and the current module path. No go mod download required. No module-cache hit. No network call.

These three runs reproduce exactly. Clone the projects at --depth 1, stage the same planted file, run shipmoor scan --changed. The fingerprints will match across runs because the analyzer is deterministic. Apple-to-apple comparison for anyone who wants to argue with the numbers.

What is intentionally not in scope

Saying yes to a focused tool means saying no to a long list of things. Here is the explicit no list:

  • Style. Use ruff, ESLint, Prettier, gofmt, golangci-lint.
  • Vulnerabilities. Use pip-audit, npm audit, govulncheck, Dependabot.
  • SAST. Use semgrep, CodeQL, Snyk Code.
  • IaC scanning. Not in v0.2.0. Dockerfile, Kubernetes, and Terraform rules are on the v0.3 roadmap. Today, IaC manifests are silently skipped.
  • Daemons / file watchers / editor integrations. The Community CLI is one binary you run on demand.
  • Telemetry, analytics, or accounts. The optional PyPI/npm registry lookup is the only outbound network call. SHIPMOOR_OFFLINE=1 disables it.

We also deliberately keep the rule count small. Each new rule adds a maintenance cost in false positives, severity calibration, and reviewer trust. The v0.2 catalog is 30 rules. The bar for any new rule is: does it catch a defect class that linters and SAST tools demonstrably miss?

What we are doing next

The public backlog has the full list. The headline items:

  • Nested-manifest discovery in degraded contexts. The smoke test for P0-16 reduced incident-bot findings by 88%, but there is a residual where the resolver cannot find any manifest root. A --project-root escape hatch and clearer “degraded mode” output is overdue.
  • TYPE_CHECKING-aware Python imports. from _typeshed import StrPath inside if TYPE_CHECKING: should not be phantom. This accounts for a meaningful fraction of the Flask full-tree noise.
  • Reverse-resolved .js imports in TypeScript. Node 16+ ESM lets you write ./foo.js to mean ./foo.ts. The Zod full-tree numbers will improve once the resolver tries the reverse mapping.
  • Comment-aware import extraction. Spot-check found ~2 false positives per 50 sampled findings against commented-out imports. Small but visible.
  • IaC support. Dockerfile (secret_env, curl_pipe_shell, latest_base), Kubernetes (privileged_container, host_escape, latest_image), Terraform (local_exec, public_storage, public_ingress). High-confidence, easy to explain. The next category to ship.

We do not have all the answers. The protocol is the thing we trust. If a new rule passes the same eight criteria on three real projects with reviewer-triaged actionable findings, it earns its place. If it does not, it does not ship.

The premise, restated

Existing tools were built around the assumption that the author is a human, the defects are human-shaped, and the moment of inspection is whenever the human chooses. AI coding agents broke that assumption. They author at machine speed, they make a different distribution of mistakes, and the moment between “the agent finished” and “I’m asking a human to review this” became a real workflow step with its own characteristics.

A focused tool that catches generated-code-shaped defects at exactly that moment, with a quiet enough false-positive profile that a working developer chooses to run it again tomorrow, is a category. Not a feature.

We built Shipmoor Community CLI to be the first credible thing in that category. The rules, the runs, the failures, and the fixes all live in the open. The bar we want to be held to is: does it pass the eight criteria on real code, repeatably, today.

Try it

curl -fsSL https://dl.shipmoor.dev/install-community-cli.sh | bash
cd path/to/your/repo
shipmoor scan --changed

If you run it on a codebase you maintain and the noise rate is higher than 15% on a real agent-authored change, we want to know. The protocol is the product.

Contact sales

Our team can help with custom support, team rollouts, and self-hosted deployments. Or to get started now, explore our self-serve plans.