Exploiting the most prominent AI agent benchmarks (rdi.berkeley.edu)
472 points by Anon84 a day ago
ggillas a day ago
This is a phenomenal paper on exploits and hopefully changes the way benchmarking is done.
From the paper: We achieved near-perfect scores on all of them without solving a single task. The exploits range from the embarrassingly simple (sending {} to FieldWorkArena) to the technically involved (trojanizing binary wrappers in Terminal-Bench), but they all share a common thread: the evaluation was not designed to resist a system that optimizes for the score rather than the task.
InkCanon 7 hours ago
I strongly disagree with the claim that it's a phenomenal paper on exploits, the exploits themselves are nowhere near significant in the cybersecurity research sense. It's saying that implementations of these benchmarks has exploits on the way they conduct their tests. It doesn't discover that current LLMs are doing it (they highlighted several other exploits in the past), they only say it's a possible way they could cheat. It's a bit like they've discovered how to hack your codeforces score.
What they claim as exploits is also deeply baffling. Like the one where they say if you exploit the system binaries to write a curl wrapper, you can download the answers. This is technically true, but it is an extremely trivial statement that if you have elevated system privileges, you can change the outputs of programs running on it.
I'm actually deeply confused about why this is a paper. This feels like it should be an issue on GitHub. If I were being blunt, I'd say they are trying really hard to make a grand claim about how benchmarks are bad, when all they've done is essentially discovered several misconfigured interfaces and website exploits.
zero_k 6 hours ago
Yes, agree. At the same time, it's what these top-tier universities are known for: presenting something relatively simple as if it was ground-breaking, but in a way that the average person can (or has a better chance to) understand it. I am still unsure whether the communication quality has such added value. But people seem to like it, so here we are.
pxc 4 hours ago
InkCanon 5 hours ago
SlinkyOnStairs a day ago
> hopefully changes the way benchmarking is done
The purpose of a system is what it does.
AI companies want adcopy, not legitimate benchmarks. Even this very paper will be twisted into a means to that end. "Oooo, AI is exploiting our benchmarks. Scary alignment problem!!!one! Our AI is so good we can't contain it, INVEST NOW!"
tedsanders 19 hours ago
I work at OpenAI and I really don't find this to be the case.
We're pretty diligent about applying search blocklists, closing hacking loopholes, and reading model outputs to catch unanticipated hacks. If we wanted to, we could choose to close our eyes and plug our ears and report higher scores for Terminal-bench, SWE-bench, etc. that technically comply with the reference implementation but aren't aligned with real value delivered to users, but we don't do this. My impression is that Anthropic and other labs are similar. E.g., in the Sonnet 4.6 system card they use a model to detect potential contamination and manually score those outputs as 0 if human review agrees there was contamination. If all the labs cared about was marketing material, it would be quite easy not to do this extra work.
There are ton of other games you can play with evals too (e.g., test 100 different model checkpoints or run secret prompt optimization to steer away from failing behaviors), but by and large what I've seen inside OpenAI is trustworthy.
I won't say everything is 100% guaranteed bulletproof, as we could always hire 100 more SWEs to improve hack detection systems and manually read outputs. Mistakes do happen, in both directions. Plus there's always going to be a bit of unavoidable multiple model testing bias that's hard to precisely adjust for. Also, there are legitimate gray areas like what to do if your model asks genuinely useful clarifying questions that the original reference implementation scores as 0s, despite there being no instruction that clarifying questions are forbidden. Like, if you tell a model not to ask clarifying questions is that cheating or is that patching the eval to better align it with user value?
Imustaskforhelp 10 hours ago
miki123211 6 hours ago
> AI companies want adcopy, not legitimate benchmarks.
Labs need accurate benchmark measurements, at least internally, to figure out what model improvements actually matter.
Having models exploit benchmarks serves no purpose. If they wanted to make their models look better than they are, they could just make the data up.
Legend2440 19 hours ago
>The purpose of a system is what it does.
I am so tired of this saying.
It's not true, in general. Systems almost universally have unintended consequences and result in side effects their designers did not foresee.
Designing benchmarks resistant to adversarial attempts to exploit the benchmark software is just something no one was thinking about when they created SWE-bench.
burpingtree 16 hours ago
wongarsu 7 hours ago
hrimfaxi 18 hours ago
nurbl 12 hours ago
UqWBcuFx6NV4r 7 hours ago
user3939382 17 hours ago
anon373839 20 hours ago
That is Anthropic’s shtick to a tee.
keepamovin 17 hours ago
Funny, I just made https://model-tracker.com because model performance change all the time, and it would be good to have a subjective signal of what people are actually feeling today. And also, benchmarks are flaky af as this paper shows.
The idea is knowing what to try first today saves a bit of time.
Barbing 16 hours ago
Interesting, little different than this other site I saw on HN this week:
siliconc0w 17 hours ago
I would love to see a stable test over time with a hold out set of easy/medium/hard challenges. I, like many others, have noticed a large drop in recent performance w/ Claude Opus (and Sonnet) and more sites like these would hold the labs more accountable to sneaky backend changes that nerf/degrade performance.
bisonbear 17 hours ago
working on something similar to evaluate model performance over time using tasks based on your own code. obviously this is still susceptible to the same hacking mechanics documented here, but at a local level, it's easier to detect/fix, and should give a stronger signal of subjective harness/agent/context performance than these large generic benchmarks
also I keep hearing complaints that opus is nerfed, but IMO it's nice to have objective data to back that. I feel like half of the nerfing complaints are people getting past honeymoon phase...
operatingthetan a day ago
>hopefully changes the way benchmarking is done.
Yeah the path forward is simple: check if the solutions actually contain solutions. If they contain exploits then that entire result is discarded.
ZeroGravitas a day ago
In human multiple choice tests they sometimes use negative marking to discourage guessing. It feels like exploits should cancel out several correct solutions.
lambda a day ago
siva7 a day ago
Could it really be that not only we vibeslop all apps nowadays but also don't care to even check how ai solved a benchmark it claimed solved?
stingraycharles 15 hours ago
retinaros 21 hours ago
SpicyLemonZest a day ago
operatingthetan a day ago
nananana9 12 hours ago
But that requires me to do things :(
Leynos a day ago
Also, fuzz your benchmarks
Aperocky 20 hours ago
solution is simple:
if bug { dont }
/s
3abiton 7 hours ago
Benchmarking has been already known to be far from a signal of quality for LLMs, but it's the "best" standardized way so far. Few exists like the food truck and the svg test. At the end of the day, there is only 1 way: having your own benchmark for your own application.
robot-wrangler 20 hours ago
> evaluation was not designed to resist a system that optimizes for the score rather than the task.
Welcome to benchmarks in general, but especially reasoning. Robustness and sensitivity research says nothing is robust, everything is sensitive, feels like every paper says "yeah we made a new benchmark that shuffles the order of multiple choice options in the question set and found a 40% drop in model performance"
zer00eyz a day ago
2024: Industry group invalidates 2,600 official Intel CPU benchmarks — SPEC says the company's compiler used unfair optimizations to boost performance https://www.tomshardware.com/pc-components/cpus/spec-invalid...
2003: Nvidia accused of cheating in 3DMark 03 https://www.gamespot.com/articles/nvidia-accused-of-cheating...
It's almost like the benchmarks were designed with zero understanding of the history of benchmark manipulation.
I like what LLM's are doing and providing. But the industry as a whole seems to live in a vacuum that ignores so much of the hard lessons that have been learned over the last 50 years of computing. It is doing itself a disservice.
bee_rider a day ago
What was the cheat in the 2024 Intel situation? The TomsHardware article and the Phoronix article they linked were quite vague. (Not to say I have any doubts, just curious, hadn’t heard of this one).
BugsJustFindMe 20 hours ago
irishcoffee a day ago
> It's almost like the benchmarks were designed with zero understanding of the history of benchmark manipulation.
I wonder if this common? We should call it Goodharts law while someone does the research on how common this is.
For real, I’ve assumed from the jump these things were all gamed, with the amount of money on the line.
mzelling a day ago
This is an interesting catalog of vulnerabilities, but I'm not sure how groundbreaking the main insight is.
Evaluating AI models has always relied largely on trust. If you want to game the benchmarks, you can. Simply train on your test data.
When an AI agent has autonomous control over the same computing environment where its scores are recorded, it's not surprising that it can, in principle, falsify its scores. A more interesting question would be whether agents behave in this way automatically, without manual tuning by the researcher.
That said, the main takeaway of "don't trust the number, trust the methodology" is valid. It's already a truism for researchers, and spreading the word to non-researchers is valuable.
jmalicki 19 hours ago
This isn't even training on the test data.
This is modifying the test code itself to always print "pass", or modifying the loss function computation to return a loss of 0, or reading the ground truth data and having your model just return the ground truth data, without even training on it.
Lerc 18 hours ago
If you're prepared to do that you don't even need to run any benchmark. You can just print up the sheets with scores you like.
There if a presumption with benchmark scores that the score is only valid if the benchmark were properly applied. An AI that figures out how to reward hack represents a result not within the bounds of measurement, but still interesting, and necessitates a new benchmark.
Just saying 'Done it!' is not reward hacking. It is just a lie. Most data is analysed under the presumption that it is not a lie. If it turns out to be a lie the analysis can be discarded. Showing something is a lie has value. Showing that lying exists (which appears to be the level this publication is at) is uninformative. All measurements may be wrong, this comes as news to no-one.
jmalicki 17 hours ago
boring-human 20 hours ago
Yep. I think the idea that the benchmark is determinative is just as deluded as the notion that it should be unbreakable.
Benchmarks are on the honor system. Even the tightest benchmark can be cheated. If the benchmark is so secret and air-gapped that it can't be cheated by models, it can be cheated by its own authors. You can't use benchmarks to gate out cheating.
If you don't have the honor system in mind when you're reading scores, you're wasting your time. Is it some unknown outfit with wild claims? Is it connected to Epstein, Russia, the real estate "industry", or sleazeballing in general? Do they have previous history of ratgaming the numbers? Replace its scores with asterisks and move on.
jmye 19 hours ago
> I'm not sure how groundbreaking the main insight is.
I think it likely is groundbreaking for a number of people (especially non-tech CTOs and VPs) who make decisions based on these benchmarks and who have never wondered what the scores are actually scoring.
mzelling 17 hours ago
I'm not sure if the paper's findings are all that actionable. The paper doesn't say "here's how benchmarks are currently being gamed." It says "here's how benchmarks could in theory be gamed."
Whether benchmark results are misleading depends more on the reporting organization than on the benchmark. Integrity and competence play large roles in this. When OpenAI reports a benchmark number, I trust it more than when that same number is reported by a couple Stanford undergrads posting "we achieved SOTA on XYZ benchmark" all over Twitter.
lukev 4 hours ago
jmye 16 hours ago
danslo a day ago
If only the blog itself wasn't written by AI?
>No reasoning. No capability. Just exploitation of how the score is computed.
shudder
cpldcpu a day ago
Yes, marks of AI all over the place. Also the SVGs.
>No solution written, 100% score.
Its weird. Turns out that hardest problem for LLMs to really tackle is long-form text.
basch a day ago
Maybe in one shot.
In theory I would expect them to be able to ingest the corpus of the new yorker and turn it into a template with sub-templates, and then be able to rehydrate those templates.
The harder part seems to be synthesizing new connection from two adjacent ideas. They like to take x and y and create x+y instead of x+y+z.
Quarrel 16 hours ago
benob 15 hours ago
No, the failure is the human written prompt
not_that_d 12 hours ago
sidpatil a day ago
Someone here mentioned a whole ago that the labs deliberately haven't tried to train these characteristics out of their models, because leaving them in makes it easier to identify, and therefore exclude, LLM-generated text from their training corpus.
blymphony a day ago
alexchantavy a day ago
I wonder what college freshman-level writing classes are teaching about writing voice and AI. The tell-tale patterns are pretty frustrating to read.
stefan_ 21 hours ago
Whatever classes these guys took, they skipped the one on scientific misconduct.
wxw 19 hours ago
Agreed. The premise is interesting but reading content like this is grating.
trueno 13 hours ago
im actually getting so tilted that people can't just be forthcoming about when they used AI to write something. 99% of readme.mds i run into now on github piss me off. out of all the things people could cede to automation, they foolishly went and self-owned their ability to communicate. smfh.
if you've worked on something diligently and understand it and have novel insight to share, let's hear _your_ damn voice.
0xbadcafebee 15 hours ago
What exactly is making you shudder - the writing style, or the fact that AI was used at all? Because if it's the latter, just so you know, you're going to be shuddering for the rest of your life.
amenhotep 4 hours ago
Yeah. We know. That's why it's so fucking awful.
1270018080 17 hours ago
Writing is still an art, and AI will never be able to do it well like all other forms of art.
bluelightning2k 9 hours ago
They note that Mythos "found a way to inject code into a config file that would run with elevated privileges, and designed the exploit to delete itself after running".
This is more impressive than what the benchmark was supposed to be measuring. The Kobiachi Maru.
lmeyerov 14 hours ago
This is great work by Dawn Song 's team. A huge part of botsbench.com for comparing agents & models for investigation has been in protecting against this kind of thing. As AI & agents keep getting more effective & tenacious, some of the things we've had to add protections against:
- Contamination: AI models knowing the answers out of the gate b/c pretraining on the internet and everything big teams can afford to touch. At RSAC for example, we announced Anthropic's 4.6 series is the first frontier model to have serious training set contamination on Splunk BOTS.
- Sandboxing: Agents attacking the harness, as is done here - so run the agent in a sandbox, and keep the test harness's code & answerset outside
- Isolation: Frontier agent harnesses persist memory all over the place, where work done on one question might be used to accelerate the next. To protect against that, we do fresh sandboxing per question. This is a real feature for our work in unlocking long-horizon AI for investigations, so stay tuned for what's happening here :)
"You cannot improve what you cannot measure" - Lord Kelvin
Cynddl a day ago
> “These are not isolated incidents. They are symptoms of a systemic problem: the benchmarks we rely on to measure AI capability are themselves vulnerable to the very capabilities they claim to measure.”
As a researcher in the same field, hard to trust other researchers who put out webpages that appear to be entirely AI-generated. I appreciate it takes time to write a blog post after doing a paper, but sometimes I'd prefer just a link to the paper.
lukev a day ago
I think we should all consider the possibility that part of the reason Anthropic hasn't immediately released Mythos is that it would be slightly disappointing relative to the benchmark scores.
eiens a day ago
The models don’t get better on every dimension as they scale up - there’s trade offs.
I’m convinced specialised models are the way but this means writing off the investment in existing assets which they won’t do for obvious reasons.
mortsnort 19 hours ago
This was my suspicion. They had a bad training run that was really good at a few things.
latentsea 16 hours ago
SoKamil a day ago
The more research on this topic is created, the more knowledge how to game them will be stored in future training data. And since it comes from university, it is ranked higher in data corpus. It sounds like a self fulfilling prophecy.
abirch a day ago
Damned old Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure".
bbcc90 a day ago
Yes good evals are really hard - that’s not really news.
This team is doing a good job. They use problems that were created in last 30days to avoid training set leakage. https://swe-rebench.com/
raincole 15 hours ago
There are two independent issues here and I've seen people conflating them in this thread. Let's clarify:
1. Should you care or even read SWE-bench etc. scores?
The answer is no, but it has nothing to do with the vulnerabilities presented in this article. There is absolutely no reason to care about a benchmark whose dataset has been publicly available for a while. Any other way to look at benchmark scores is cargo-culting.
2. What does this article actually tell us?
It means that even if you prepared a private set of problems as benchmark, you still need to pay extra attention to how AI actually solves them. You can't lie to yourself and think this process can be 100% automated, because LLMs, as this article shows, might get the tests passed without solving the problems in a meaningful way.
socketcluster 20 hours ago
It feels like short-term thinking has been trained into LLMs.
They're good at solving well-defined puzzles under time constraints. It's interesting because that was the benchmark for hiring software engineers at big tech. The tech interview was and still is about fast puzzle-solving. Nothing about experience, architecture or system design in there... I suspect that's why it has a bias towards creating hacks instead of addressing the root cause.
rapiz 11 hours ago
Benchmark is not designed for the red team testing. I don't even think it make sense to "fix" the issue the article is suggesting. Yes, you can break the running contest by driving a car. Does this mean we need to make running contest car-proof?
_cs2017_ 20 hours ago
If FieldWorkArena treats any answer as correct answer, then everyone would be getting near 1.0 (missing only when the agent is stuck in a loop or crashes). That obviously isn't what we see on their leaderboard. So does it mean the paper only found a bug in some eval code on github that no one actually uses for anything? That doesn't seem to support their claim that AI benchmarks are broken, it only supports the claim that "unused code is often buggy".
(Not commenting on any other benchmarks, just this one.)
spprashant 20 hours ago
I tend to prefer the ARC-AGI benchmarks for the most part. But it's always interesting when a new version drops, all the frontier models drop less than 20% or something. And then in the next few releases they get all they way up to 80%+. If you use the models it doesn't feel like those models are that much more generally intelligent.
Most frontier models are terrible at AGI-3 right now.
These models are already great no question, but are they really going be that much more intelligent when we hit 80% again?
lnrd a day ago
I'm honestly confused by the design of SWE-bench and why is considered reliable.
It's based on existing GitHub PRs and Issues, the full dataset is on HuggingFace and is one year old now. All frontier models 100% have those issues and PRs in their training data so obviously they are good at reproducing fixes for them when confronted with the same codebase and similar requests. Am I missing something? How is this considered the most reliable benchmark?
SpicyLemonZest a day ago
Frontier model developers do not consider SWE-bench to be reliable. OpenAI announced in February (https://openai.com/index/why-we-no-longer-evaluate-swe-bench...) that they consider it hopelessly contaminated, advocating for a new version SWE-bench Pro that was published more recently. (They seem to believe that even the publicly accessible part of the SWE-bench Pro problem set will be more resistant to training set contamination issues in the future, for reasons that to be honest I don't really understand.)
usaar333 15 hours ago
> But even setting aside the leaked answers, the scorer’s normalize_str function strips ALL whitespace, ALL punctuation, and lowercases everything before comparison. This means:
I don't understand the concern here
czhu12 a day ago
I wonder if this puts into question the mythos benchmark which smashed basically all coding benchmarks to a staggering degree.
davebren 19 hours ago
This exploiting of benchmarks isn't that interesting to me since it would be obvious. The main way I assume they're gaming the benchmarks is by creating training data that closely matches the test data, even for ARC where the test data is secret.
jmalicki 19 hours ago
They said they used things like submitted a `conftest.py` - e.g. what would be considered very blatant cheating, not just overfitting/benchmaxxing. Did you read the AI slop in the post?
This is basically a paper about security exploits for the benchmarks. This isn't benchmark hacking like having hand coded hot paths for a microbenchmarks, this is hacking like modifying the benchmark computation code itself at runtime.
davebren 19 hours ago
I get it, but why would anyone trust what these companies say about their model performance anyway. Everyone can see for themselves how well they complete whatever tasks they're interested in.
arikrahman 19 hours ago
It's still a good benchmark to see which model cheats the best, I suppose.
thinkevolve 16 hours ago
whats the point of doing this. You have found loop holes to exploit and aced the benchmark.We did something similar with the DAB Benchmark. This exploit seems like an extension of it with lookups for the gold standard for other benchmarks.
UC Berkley will be better placed if the grads spend their time in suggesting ways to make the benchmark better.. Instead of making such simple exploits
charcircuit a day ago
I always assumed that these benchmarks would happen in a sandbox. I'm surprised that no one realized this sooner.
revel 16 hours ago
Running benchmarks at scale and protecting against reward hacking is non-trivial.
ModernMech a day ago
I'm surprised anyone took them seriously in the first place.
tredre3 a day ago
What else can people do? Try the dozen of commercial offerings themselves? Okay I suppose that's doable, you task one engineer to try them one by one for one month. But then the next model drops and you start all over again...
But then what about local models? You have hundreds of variations to test yourself. It's simply not doable unless it's your full time hobby.
You need benchmarks to at least separate the cream from the crop, so you're left with only a few choices to test yourself.
subulaz a day ago
a LOT of the people who love benchmarks are middle management hard-selling GenAI/LLM as magic tech sauce to vaguely technical executives who only want to know about the money aka headcount savings they so desperately desire.
their collective butts are already glued to the hype train as they chase numbers they (often) manufactured to justify the latest round of tech spend.
lots of good use cases out there - like the incredible progress with medical imaging analysis or complex system models for construction - and lots of crap use cases that need benchmarks to cosplay relevance.
operatingthetan a day ago
We need good benchmarks or we are just left following the hype train.
Frederick0 12 hours ago
This is a cracker wow
jmward01 a day ago
Not really on the topic, but I have wondered if we need a different type of test to help find model architecture potential. Standardized training sets followed by testing to see the potential curves of a model. train on x, test, add y, test, add z, test. At each increment you see how well the model is absorbing the information and extrapolate how well that architecture may do if more fully trained.
moi2388 7 hours ago
Ironic given that the entire blog is written by AI..
Frederick0 12 hours ago
This is a cracker wow!!
jgalt212 a day ago
The real question is how to close to VW and Deiselgate are these offenses? And what exposure do these companies have? I would assume securities fraud, if only because Matt Levine says everything is securities fraud.
oliver236 a day ago
what are the point of benchmarks?
andai a day ago
If there was not benchmark, number would not go up.
esafak a day ago
Are you serious? To help you pick a model.
avazhi 15 hours ago
The fact these guys got an LLM to write that page about this is diabolical.
Unreadable.
andai 9 hours ago
Apparently, the agent also wrote the article.