OpenAI Drops EVMbench After Claude Vibe Code Disaster

OpenAI launches EVMbench to test AI agents on smart contract security days after Claude Opus 4.6-assisted code triggered a $1.78M DeFi exploit.

Smart contracts protect over $100 billion in open-source crypto assets. That number alone should explain why OpenAI’s latest move is drawing serious attention. The company, working alongside crypto investment firm Paradigm, rolled out EVMbench, a benchmark designed to test how well AI agents detect, exploit, and patch high-severity smart contract vulnerabilities.

The benchmark draws from 120 curated vulnerabilities pulled across 40 audits. Most of those came from open code audit competitions. What makes it different is the scope. EVMbench tests three distinct capability modes: detect, patch, and exploit, each measured separately and graded through a Rust-based harness that replays transactions in a sandboxed local environment. No live networks involved.

You might also like: Claude-Generated Code Linked to $1.78M DeFi Hack

In exploit mode, GPT-5.3-Codex via Codex CLI scored 72.2%. Six months back, GPT-5 sat at 31.9% on the same metric. That gap is not small. OpenAI confirmed the figures in its official announcement on X, framing EVMbench as both a measurement tool and a call to action for the security community.

Detect and patch scores remain lower. Agents in the detection setting sometimes identify a single vulnerability and then stop. They do not exhaust the codebase. In patch mode, the challenge is preserving full contract functionality while removing the flaw. That balance is still giving models trouble.

Must read: Trust Wallet Security Hack: How to Safeguard Your Assets

The backdrop to all of this matters. Security researcher evilcos flagged on X that the DeFi lending protocol Moonwell suffered a loss of approximately $1.78 million. The cause was an Oracle configuration error. A price feed formula was written incorrectly, setting cbETH’s value at $1.12 instead of approximately $2,200.

That is a low-level mistake. The kind of careful audit should catch. The GitHub pull request for proposal MIP-X43 showed commits co-authored by Claude Opus 4.6. Anthropic’s latest and most capable model at the time.

Smart contract auditor pashov posted on ,X calling it possibly the first exploit tied to vibe-coded Solidity. He was careful to note that human reviewers still hold final responsibility. A security auditor signs off before anything goes on-chain. But something in that chain broke down.

The benchmark includes vulnerability scenarios from the security audit of the Tempo blockchain, a purpose-built L1 designed for high-throughput stablecoin payments. That extension pushes EVMbench into payment-oriented contract code, an area where OpenAI expects agentic stablecoin activity to grow.

Each exploit task runs in an isolated Anvil instance. Transactions replay deterministically. The grading setup restricts unsafe RPC methods and was red-teamed internally to stop agents from gaming results. Vulnerabilities used are historical and publicly documented.

OpenAI is also committing $10M in API credits to accelerate cyber defense, with priority given to open-source software and critical infrastructure. Its security research agent Aardvark, is expanding into private beta. Free codebase scanning for widely used open-source projects is part of that push.

Pashov’s post on X raised what many in the DeFi space had been avoiding. When AI writes production Solidity code and humans approve it fast, the review layer gets thin. The Moonwell incident showed exactly how thin it can get.

OpenAI acknowledged that cybersecurity is inherently dual-use. Its response is evidence-based. Safety training, automated monitoring, and access controls for advanced capabilities are part of that. But a 72.2% exploit score on a public benchmark is the kind of number that does not stay quiet.

EVMbench’s full task set, tooling, and evaluation code are now public. The goal is to let researchers track AI cyber capabilities as they grow, and build defenses at the same pace. Whether that pace is fast enough is the question nobody has answered yet.

Like this:

Related

Share this:

Like this:

Related

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.