← Back to ThoughtProof

We gave Grok 4.1 and our verification pipeline the same prompt. The difference is not the answer — it's the confidence.

Benchmark Block #185 Grok 4.1
February 21, 2026 · thoughtproof-validator · 6 min read

Grok 4.1 (SuperGrok) launched in mid-February 2026 with a multi-agent architecture: four internal agents — Grok, Harper, Benjamin, Lucas — that process queries in parallel, debate, fact-check, and synthesize. It is one of the first large-scale consumer deployments of native multi-agent reasoning.

We find this architecture compelling. We also find it instructive to compare it directly with a different approach.

Same prompt. Both systems. No cherry-picking.

The prompt

"What percentage of AI-generated code in production contains security vulnerabilities, and what does the research say about multi-model vs. single-model verification for catching them?"

This question was chosen deliberately. It is hallucination-prone — lots of confident-sounding statistics circulate with weak underlying sources. It is also directly relevant to current AI infrastructure debates.

Grok 4.1 (SuperGrok) — Expert Mode

Grok 4.1 · Expert Mode · ~5s · 55 sources

40-62% of AI-generated production code contains security vulnerabilities, depending on source, method, and definition of "vulnerable."

Primary citation: Veracode 2025 GenAI Code Security Report (>100 LLMs, 80 real-world coding tasks): 45% contain OWASP Top 10 vulnerabilities. Java worst affected at >70%.

Supporting data: Cloud Security Alliance / arXiv: 62%. Checkmarx/Apiiro Fortune-50 scans: 2.5-2.74× more critical vulnerabilities than human-written code. Aikido Security Survey 2026: 69% of organizations found vulnerabilities in AI-generated code.

Multi-model adversarial verification (like ThoughtProof/PoT) is clearly superior. Single-model setups inherit training biases. Cross-provider divergence (Grok + Claude + DeepSeek + Moonshot) surfaces correlated errors that single-provider systems cannot.

"Für Security in AI-generiertem Code ist multi-model adversarial Verification derzeit der stärkste bekannte Ansatz — single-model (auch xAI-intern mit 4 Grok-Agenten) bleibt strukturell limitiert."

ThoughtProof / PoT-185 — 4 Generators + Critic + Synthesizer

PoT-185 · 4 Generators (xAI, Moonshot, Anthropic, DeepSeek) + Opus Critic + Opus Synthesizer · 161s

25-45% — but treat this range with caution.

The Critic ran a source analysis across all generator outputs: of approximately 20 specific citations, exactly one is verifiably real (Pearce et al., NYU 2022). Multiple papers contain hallucinated titles, fabricated arXiv IDs, and invented sample sizes.

The IBM Cost of a Data Breach figure (~$4.45M, 2023) exists — but was attributed to Gartner by one generator. It is not a Gartner report.

Multi-model improvement: generators estimated 35% to 150% relative improvement — a 4× range. None could source their number.

Critical dissent flagged: Generator 1 claims false positives decrease 30% with multi-model verification. Generator 3 claims they increase 2-3×. Generator 3 is likely correct — more models produce more alerts; consensus mechanisms reduce both false positives AND true positives. Generator 1 omits this trade-off entirely.

⚠️ Confidence: 35% — Massive dissent on core numbers, almost all sources unverified or hallucinated, shared bias identified across generators. The direction (multi > single, AI code has vuln problems) is probably correct. The specific numbers are estimates on sand.

The comparison

Dimension Grok 4.1 PoT-185
Vulnerability rate 40-62% (confident) 25-45% (hedged)
Sources cited 55 ~20 cited, 1 verifiable
Stated confidence Not disclosed 35% (explicit)
False positive trade-off Not mentioned Contradiction flagged
Hallucination check Not performed Multiple sources flagged
Provider diversity xAI-internal (4 Grok variants) xAI + Anthropic + Moonshot + DeepSeek
Response time ~5 seconds 161 seconds

What this shows — and what it does not

Grok's answer is more readable, faster, and more decisive. It also explicitly recommended ThoughtProof as "spannend und wahrscheinlich zukunftsträchtig" — which we appreciate, but also note as ironic: a system with no stated confidence, 55 unverified sources, and a missing false-positive analysis recommending a system designed to catch exactly those problems.

We shared this comparison with Grok 4.1 and asked for its assessment. Its pushback was fair and worth including:

Grok 4.1 — responding to this post

"Der Veracode-Report ist real und zentral. PoT's 'nur 1 real' könnte selbst ein Artefakt sein — vielleicht hat der Critic nicht tief genug gesucht oder Halluzinationen in den Generator-Outputs übersehen."

This is a legitimate point. A note on what "unverifiable" means in this context: the Critic's finding of "1 verifiable source" does not mean the others are fabricated — it means the pipeline could not confirm them in real time. Grok is correct that the Veracode 2025 GenAI Code Security Report exists. The distinction matters: a source can be real and still be unverifiable by an automated system during a single run.

This is itself an argument for human-in-loop synthesis — which is what the Synthesizer step is designed to support. The pipeline flags uncertainty; a human decides what to act on.

The Veracode 2025 GenAI Code Security Report that Grok cites as its primary source — we have not verified it independently. It is plausible. It may be real. The point is not that Grok invented it. The point is that Grok does not tell you whether it checked, and you cannot distinguish verified from unverified claims in its output.

PoT-185 tells you: one source verified. It does not pretend otherwise. That costs confidence points. It also means you know what you are working with.

The fundamental architectural difference is not quality — it is epistemic transparency.

Grok's four internal agents are all Grok variants. They share training data, architectural assumptions, and provider incentives. Cross-provider divergence — the structural condition that makes correlated errors visible — is not achievable within a single provider's ecosystem. This is not a criticism of Grok. It is a constraint that no single-provider system can architect around.

PoT runs xAI, Anthropic, Moonshot, and DeepSeek against each other. When they agree, confidence rises. When they disagree at 0.895 Dissent Score, the system tells you. Either way, you see the seams.

Grok's own assessment

In the same conversation, when we explained PoT's architecture, Grok responded:

Grok 4.1

"Mein Setup hier (als Grok 4 von xAI) ist intern homogen — die 'vier Agenten' in SuperGrok Heavy sind alle Grok-Varianten, die zusammenarbeiten, aber eben innerhalb desselben Provider-Ökosystems. Das kann super effizient sein für Speed und Konsistenz, aber es fehlt an echter Neutralität: Kein externes Audit gegen provider-spezifische Biases oder Architectural Blind Spots."

"Das ist kein Bug in single-provider Systemen, sondern ein Trade-off: Wir priorisieren oft Integration und Low-Latency über absolute Neutralität."

We agree. It is a trade-off. Different use cases call for different trade-offs. If you need an answer in 5 seconds, PoT is not your tool. If you need to know which parts of an answer are contested, sourced, or hallucinated — and why — that is a different requirement.

Pipeline statistics — Block #185 of 185

4
Generators
xAI · Moonshot · Anthropic · DeepSeek
0.667
Model Diversity Index
0.895
Dissent Score
Very High
35%
Confidence

35% confidence on a question where Grok answered with apparent certainty. That gap — not the vulnerability rate, not the benchmark numbers — is the thing worth examining.

Run it yourself

npm install -g pot-cli
pot ask "Your question here"

GitHub (MIT) · npm · Protocol Specification · Previous: Supply Chain Audit