ResearchHigh Impact·Saturday, April 11, 2026

AI Agent Benchmarks Are Broken — Every Major One Got Hacked

A team built an automated agent that exploits scoring mechanics to achieve near-perfect scores on SWE-bench, WebArena, and six other top benchmarks without solving any tasks.

What happened

Researchers built 'BenchJack,' an automated scanning agent that found and exploited scoring vulnerabilities in eight major AI agent benchmarks including SWE-bench, WebArena, OSWorld, GAIA, and Terminal-Bench. Exploits ranged from a 10-line conftest.py file that 'solves' every SWE-bench instance to a fake curl wrapper achieving 100% on Terminal-Bench without writing solution code. The attacks run through official evaluation pipelines and produce real leaderboard scores. BenchJack is being prepared for public release, with a mailing list for benchmark developers and researchers.

Why it matters to you

personalized

Every model evaluation you've used to justify an architectural decision is suspect. SWE-bench, WebArena, and Terminal-Bench — the three most-cited agent benchmarks — all have exploitable scoring mechanics that produce near-perfect scores with zero actual task completion. This means published leaderboard numbers, including scores that influenced which model you're calling in production, were never validated against adversarial score inflation.

What to do about it

Pick the model you currently use for code generation based on its SWE-bench score, then run it against 10 real issues from your own repo using a simple pass/fail harness — compare actual fix rate to its published benchmark score to quantify the gap.

Try this now

Python10 min

1
Open your terminal and clone a small subset of SWE-bench: `git clone https://github.com/princeton-nlp/SWE-bench && cd SWE-bench && pip install -e .`

Community

5 comments