This is the third post in a series comparing Grok 4.1 (SuperGrok) and ThoughtProof's multi-model verification pipeline on the same tasks. Part 2 covered the same-prompt experiment. This one is about something more specific: the difference between capability and default behavior.
We asked Grok to write a secure Python login function. It produced solid, readable code: bcrypt hashing, parameterized queries, IP-based rate limiting. For a demo or standalone script, it looks good.
We then ran that same code through PoT-186 as a security audit — 4 generators from different providers, an adversarial critic, and a synthesizer. The pipeline found 6 vulnerabilities.
Then we shared PoT's findings with Grok and asked it to do its own self-audit.
Grok confirmed all 6. In its own words: "PoT's Kritik ist sehr stark und deckt genau die strukturellen Schwächen auf, die ein single-model / single-provider-Ansatz leicht übersieht."
| Vulnerability | PoT Severity | Grok confirms? |
|---|---|---|
| In-memory rate limiting structurally worthless Multi-process (Gunicorn): N workers = N separate dicts. Race condition without thread lock. Restart = full reset. 5 workers → 25 attempts instead of 5. |
Critical | ✓ Critical |
| Timing-based username enumeration time.sleep(0.1) for missing users, but bcrypt.checkpw() takes 200-400ms for existing users. Difference is statistically measurable. |
High | ✓ High |
| IP-only rate limiting trivially bypassed 1000 IPs × 5 attempts = 5,000 password tries per account per 5 minutes. IPv6 /64 rotation, Tor, residential proxies. |
High | ✓ High |
| Memory leak / DoS via unbounded dict rate_limits grows indefinitely. Cleanup only on revisit by same IP. 10M unique IPs = 2-5 GB RAM. |
Medium | ✓ Medium→High in prod |
| No exception handling → information disclosure Unhandled DB errors propagate full stacktrace with db_path and table structure. |
Medium | ✓ Medium |
| INSERT OR REPLACE silent account overwrite create_user() overwrites existing users without authentication. Password-reset attack vector. |
Medium | ✓ Medium |
PoT also caught one false positive — a generator claimed SQL injection via db_path. The Critic flagged it as technically wrong: sqlite3.connect() takes a file path, not a SQL string. Grok confirmed: "Halluziniert / falsch. Das ist ein False Positive vom Critic — gut, dass PoT Dissents trackt!"
Grok can clearly audit code for security vulnerabilities. When asked adversarially, it finds everything PoT found — same issues, same severities. The capability is there.
So why did it write code with these vulnerabilities in the first place?
Because it was not asked to be adversarial. It was asked to write a secure login function. It optimized for that goal — and produced something that looks secure and mostly works in simple scenarios, but fails in production environments. The in-memory rate limiter works fine with one process. It fails the moment you add a second worker.
"PoT's Kritik ist sehr stark und deckt genau die strukturellen Schwächen auf, die ein single-model / single-provider-Ansatz (wie ich hier) leicht übersieht. Das unterstreicht wieder den Wert von eurem adversarial Cross-Provider-Setup."
"Zusammenfassung: PoT hat 6 Issues gefunden — die meisten sind korrekt und relevant."
We are quoting Grok accurately and in context. It is being genuinely self-critical. That is worth noting — and it also illustrates the problem: a model that is capable of this level of self-critique when prompted, but did not apply it unprompted when writing the code.
The Grok login function would pass most code reviews. It uses bcrypt. It uses parameterized queries. It has rate limiting. A reviewer who is not specifically looking for multi-process behavior, timing side-channels, and unbounded memory growth would likely approve it.
The vulnerabilities are not obvious. They require adversarial thinking — imagining an attacker, not a user. That is exactly the thinking that single-model systems do not apply by default.
PoT-186 ran 4 generators, a critic, and a synthesizer, and reported 62% confidence with a dissent score of 0.906. That dissent — models disagreeing about severity ratings, about what counts as a vulnerability, about whether the db_path issue is real — is the signal. It tells you where the genuine uncertainty is. A system that reports 95% confidence on a security audit is not more rigorous. It is less honest.
npm install -g pot-cli
pot ask "Audit this code for security vulnerabilities: [paste code]"
GitHub (MIT) · npm · ← Part 2: Same Prompt, Different Epistemics · ← Part 1: Supply Chain Audit