A team built an automated agent that exploits scoring mechanics to achieve near-perfect scores on SWE-bench, WebArena, and six other top benchmarks without solving any tasks.
Researchers built 'BenchJack,' an automated scanning agent that found and exploited scoring vulnerabilities in eight major AI agent benchmarks including SWE-bench, WebArena, OSWorld, GAIA, and Terminal-Bench. Exploits ranged from a 10-line conftest.py file that 'solves' every SWE-bench instance to a fake curl wrapper achieving 100% on Terminal-Bench without writing solution code. The attacks run through official evaluation pipelines and produce real leaderboard scores. BenchJack is being prepared for public release, with a mailing list for benchmark developers and researchers.
Every model evaluation you've used to justify an architectural decision is suspect. SWE-bench, WebArena, and Terminal-Bench — the three most-cited agent benchmarks — all have exploitable scoring mechanics that produce near-perfect scores with zero actual task completion. This means published leaderboard numbers, including scores that influenced which model you're calling in production, were never validated against adversarial score inflation.
Pick the model you currently use for code generation based on its SWE-bench score, then run it against 10 real issues from your own repo using a simple pass/fail harness — compare actual fix rate to its published benchmark score to quantify the gap.
Open your terminal and clone a small subset of SWE-bench: `git clone https://github.com/princeton-nlp/SWE-bench && cd SWE-bench && pip install -e .`
Tags
Also today
Signals by role
Also today
Tools mentioned