
A new benchmark evaluates how well AI models can detect, patch, and exploit smart contract vulnerabilities as concerns grow over AI-driven DeFi attacks.
OpenAI and crypto investor Paradigm have introduced EVMbench, a new benchmark designed to measure how well AI agents can find, fix, and even exploit vulnerabilities in smart contracts. The timing is notable.
Recent attacks on projects such as Moonwell and CrossCurve have underscored how fragile DeFi code can be, especially as AI tools become more involved in writing it.
EVMbench draws from 120 high severity vulnerabilities across 40 audits, including material from open audit competitions and security work tied to the Tempo blockchain. The benchmark evaluates agents across three modes.
In detect mode, models must identify known flaws. In patch mode, they must fix vulnerabilities without breaking functionality. In exploit mode, they attempt to drain funds in a controlled sandbox environment.
In early testing, OpenAI said its GPT-5.3-Codex model significantly outperformed earlier systems in the exploit setting, though detection and patching remain incomplete.

