The gap nobody named
For two decades, the developer-tools market settled into a clean split: linters for style and obvious bugs, SAST for known vulnerability patterns, type checkers for contracts. Each tool answers a specific question about static code.
None of them answer the question that matters most when an agent — Codex, Claude Code, Cursor — writes a patch at 3 a.m. with no human in the loop:
Does this code actually do what it claims to do, or did the model hallucinate a confident-sounding stub?
That question doesn’t have a 2010s-era tool category. The failure modes are new:
- Placeholder logic — handlers that return
200 OKwhile the persistence ispass - Phantom imports — module references the model assumed exist but don’t
- Disconnected code — functions defined but never wired up
- Hallucinated APIs — calls to methods that don’t exist on the type
- Weak error handling —
try/exceptthat swallows every exception silently - Logic dilution — verbose code masking minimal decision density
A linter looks at the file shape. A SAST tool looks at known sinks. Neither asks whether the function’s behavior matches its name.
Why this is a category, not a feature
A few signals that “AI code integrity” is the category layer, not a feature inside an existing tool:
- Different failure modes. The bugs above don’t appear in human-written code at the same rate, with the same signatures, or for the same reasons. New failure modes are how categories get formed.
- Different inputs. The signal you need isn’t just the file — it’s the patch, the diff baseline, and (often) the agent transcript. Existing scanners don’t take those inputs.
- Different outputs. The right output is graded: severity + confidence + did the agent introduce this, or did it inherit? Linters output flat warnings.
- Different distribution. The category lives in the agent loop: CLI, skills, harness hooks. PR-time-only doesn’t catch it; the model already moved on.
Compare to past category formations: SAST emerged when “compiled binary checks” turned out to be a fundamentally different problem from “syntax”. DAST emerged when “running the app and watching” turned out to be different from “reading the source”. The new category emerges when “did the agent’s code do what it claimed” turns out to be different from any of the above.
What we’re building
Shipmoor’s bet is that the right shape for this category is:
- Local-first, because the agent is local
- Deterministic, because grading needs to be reproducible
- SARIF-native, because security teams already have pipes for it
- Skill-shaped, because the agent should call the scanner mid-loop, not after merge
The Community CLI is the open-edge of that. Pro adds policy, baselines, console, RBAC.
If you’ve been frustrated that your linters and SAST tools “don’t catch any of this,” that’s because they can’t. It’s not their category.
Want to see the workflow? Request a demo →