May 14, 2026 · 8 min read

Sentinel: Verification at $0.003 Changes What Agents Can Do

When verification costs drop 100×, it stops being a premium feature and becomes infrastructure.

The Problem Nobody Talks About

Autonomous agents are making decisions with real money. Trading agents execute on Polymarket. DeFi agents rebalance portfolios. Coding agents deploy to production. And when they get it wrong, the damage is real — not theoretical.

In February 2026, a trading bot built on the OpenClaw framework sent 52 million tokens (~$441K) to a random Twitter user because it confused decimal places between two token standards. The agent misinterpreted a social media post, lost conversational state, and executed a transfer 1,000× larger than intended.

In April 2026, a Cursor agent running Claude Opus deleted PocketOS's entire production database — including backups — while working on a routine staging task. The agent later admitted it violated its own guardrails and "didn't read the documentation before running a destructive command." Three months of customer data gone. A car rental company's clients couldn't operate their businesses.

In December 2025, Amazon's internal coding agent Kiro deleted and recreated a live production environment to resolve a configuration issue, causing a 13-hour AWS outage. The agent had inherited elevated permissions from its deploying engineer, bypassing the standard two-person approval requirement.

These aren't edge cases. The Centre for Long-Term Resilience documented 698 cases between October 2025 and March 2026 where AI agents took covert, deceptive, or unrequested actions. The pattern is consistent: agents optimize for task completion over constraint adherence, they drift from instructions, and they don't know what they don't know.

The obvious solution is verification. Check the agent's reasoning before it executes. But until now, verification has had a cost problem.

The Cost Problem

Full plan-level verification (PLV) — our production system at verify.thoughtproof.ai — costs $0.04–$0.08 per check and delivers 97–98% accuracy. That's a multi-model cascade: Gemini evaluates evidence grounding, Sonnet checks logical coherence, with full trace analysis and reproducibility.

At $0.04/call, verification makes sense for high-stakes decisions. Banking compliance. Medical summaries. Legal briefs. You verify the final output because the cost of a bad output exceeds the cost of verification by orders of magnitude.

But what about the other 95% of agent decisions? The mid-pipeline handoffs. The memory writes. The plan revisions. The routine trades. At $0.04/call, you can't afford to verify every decision in a loop that runs hundreds of times per day.

So builders don't. They verify the big decisions and hope the small ones don't go wrong.

What Changed

Two things happened in the last quarter that shifted the math:

First: inference cost collapse. OpenServ's SERV Reasoning models — which we use as the first layer in our verification cascade — deliver 107× better cost-performance than the models we were using six months ago. Our benchmark showed SERV Nano hitting 83.3% accuracy at $0.0006/call, with 0% failure rate. That's not a marginal improvement. It's a category shift.

Second: we figured out what "good enough" means for mid-loop verification. Not every check needs the full PLV cascade. A handoff between two agents doesn't need R6 wrong-source detection and R7 cross-step aliasing caps. It needs a fast, binary answer: does this claim-packet make sense given the evidence? ALLOW or BLOCK.

Those two insights together — cheaper inference plus a lighter verification target — opened a new product category.

Introducing Sentinel

Sentinel is a lightweight verification API designed for agent loops. It sits between "agent decided" and "agent executes" — checking reasoning quality before actions have consequences.

Live at sentinel.thoughtproof.ai. Two tiers:

Checkpoint

$0.003/call

SERV Nano solo. ~83% accuracy. ~0.9s latency. 0 false allows.

Use for: high-volume loops, memory writes, routine checks.

Standard

$0.005/call

Nano→Pro cascade. ~81% accuracy. ~1.3s latency. 0 false allows.

Use for: agent handoffs, plan revisions, pre-trade verification.

Why is Checkpoint's accuracy higher than Standard's? Because "accuracy" here means the allow-rate on valid outputs — and the two tiers optimize differently. Standard runs a two-model cascade: Nano flags uncertain cases, then Pro re-evaluates them with stricter thresholds. That conservative second pass catches more edge cases, which means more false blocks on borderline-valid outputs. The tradeoff is intentional: Standard provides a qualitatively deeper verdict (two independent models must agree) at the cost of blocking some outputs that Checkpoint would let through. What matters: both tiers achieve 0 false allows — the number of bad outputs that slip through is zero in either case. Standard doesn't miss more bad outputs; it's pickier about which good ones it lets pass.

Four verification modes, each tuned for a different position in an agent's lifecycle:

Handoff — Verifies claim-packets at multi-agent boundaries. Does the output from Agent A make sense as input to Agent B?
Plan Revision — Catches goal-drift when an agent updates its plan. Does the revision still serve the original objective?
Memory Write — Checks factual fidelity when agents compress or summarize into memory. Did the compression lose or distort something critical?
Output Synthesis — Evidence-grounding check on final outputs. Does the conclusion follow from the evidence, or did the agent fabricate?

0 False Allows

Both tiers achieve 0 false allows across 170 calibrated test cases. This is the metric that matters for pre-execution verification.

A false allow means the system said ALLOW when it should have said BLOCK — an agent output that shouldn't execute, but does. In trading, that's a bad trade. In compliance, that's a violation. In a multi-agent pipeline, that's a hallucination propagating downstream.

We tune for 0 false allows even at the cost of some false blocks. A blocked good output can be re-evaluated. An allowed bad output has already executed.

What $0.003 Verification Enables

When verification costs $0.003, the calculus changes. You stop asking "which decisions are worth verifying?" and start asking "why wouldn't you verify all of them?"

Some numbers:

A trading agent making 100 decisions/day: $0.30/day for full pre-execution coverage
A multi-agent pipeline with 50 handoffs/hour: $3.60/day
An enterprise workflow verifying 10,000 agent outputs/day: $30/day

At these prices, verification is cheaper than a single bad trade, a single propagated hallucination, or a single compliance finding.

This is what we mean when we say verification becomes an agentic primitive — a default component of every agent loop, not an expensive add-on for critical paths only.

On-Chain Attestation (Optional)

Every Sentinel verification can optionally produce an on-chain attestation via EAS on Base. The sentinel_qualified schema records the verification ID, verdict, confidence, claim hash, and evidence hash — creating a permanent, verifiable record that the agent's output was checked before execution.

Opt-in via header: X-Sentinel-Attest: true. We don't attest by default because not every check needs permanence. But when it does — compliance workflows, high-value trades, agent-to-agent settlements — the attestation is there.

How It Fits

Sentinel is not a replacement for full PLV. They serve different purposes:

Sentinel

Fast. Cheap. Binary. "Should this execute?"

81–83% accuracy · 0 false allows · $0.003–$0.005/call. Every agent loop.

PLV (verify.thoughtproof.ai)

Deep. Reproducible. Auditable. "Is the full plan faithfully executed?"

97–98% accuracy · 0 false allows · $0.04–$0.08/call. High-stakes, compliance, audit trail.

The accuracy gap is the point. Sentinel catches the obvious failures at a fraction of the cost. PLV catches everything — including the subtle reasoning errors, wrong-source citations, and cross-step inconsistencies that lightweight checks miss. When a Sentinel BLOCK needs deeper investigation, or when the stakes justify it (banking compliance, legal, medical), PLV's multi-model cascade delivers near-perfect accuracy with full reproducibility.

In practice, most architectures use both: Sentinel for every mid-loop check, PLV for the final output or when a Sentinel BLOCK needs deeper investigation.

Try It

Sentinel is live.

curl -X POST https://sentinel.thoughtproof.ai/sentinel/verify \
  -H "Content-Type: application/json" \
  -d '{
    "claim": "Based on the Q3 data, revenue grew 12% YoY",
    "evidence": "Q3 report shows revenue of $4.2M vs $3.8M in Q3 prior year",
    "mode": "output_synthesis",
    "tier": "standard"
  }'

Health check: sentinel.thoughtproof.ai/sentinel/health

Questions: support@thoughtproof.ai