Even GPT-5.2 Can't Count to Five: Zero-Error Horizons in Trustworthy LLMs (arxiv.org)
50 points by daigoba66 4 hours ago
grey-area 2 hours ago
To those saying this is not surprising, yes it will be surprising to the general public who are being served ads from huge companies like MS or OpenAI saying LLMs can help with their accounting, help them close deals by crunching the numbers in seconds, write complex code for them etc etc.
This is important information for anyone to understand who thinks these systems are thinking, reasoning, and learning from them or that they’re having a conversation with them i.e. 90% of users of LLMs.
stratos123 an hour ago
> saying LLMs can help with their accounting, help them close deals by crunching the numbers in seconds, write complex code for them etc etc.
Why do you think the results of this paper contradict these claims at all?
orbital-decay an hour ago
Quick sanity check: you're susceptible to pretty irresistible optical illusions which would never fool a VLM, does it mean you're not thinking? In fact, with a non-monospaced font I also have trouble determining whether these parens are balanced, and have to select them with the mouse, i.e. use a "dumb" tool, to make sure.
Reminder that "thinking" is an ill-defined term like others, and the question whether they "think" is basically irrelevant. No intelligent system, human or machine, will ever have zero error rate, due to the very nature of intelligence (another vague term). You have to deal with that the same way you deal with it in humans - either treat bugs as bugs and build systems resilient to bugs, or accept the baseline error rate if it's low enough.
hu3 an hour ago
> we found that GPT-5.2 cannot even compute the parity of a short string like 11000, and GPT-5.2 cannot determine whether the parentheses in ((((()))))) are balanced.
I think there is a valid insight here which many already know: LLMs are much more reliable at creating scripts and automation to do certain tasks than doing these tasks themselves.
For example if I provide an LLM my database schema and tell it to scan for redundant indexes and point out wrong naming conventions, it might do a passable but incomplete job.
But if I tell the LLM to code a python or nodejs script to do the same, I get significantly better results. And it's often faster too to generate and run the script than to let LLMs process large SQL files.
emp17344 an hour ago
There’s a certain type of user here who reacts with rage when anyone points out flaws with LLMs. Why is that?
Topfi an hour ago
No disrespect to them, but unless there is a financial incentive at stake for them (beyond SnP500 exposure), I've gotten to viewing this through the lens of sports teams, gaming consoles and religions. You pick your side, early and guided by hype and there is no way that choice can have been wrong (just like the Wii U, Dreamcast, etc. was the best).
Their viewpoint on this technology has become part of the identity for some unfortunately and any position that isn't either "AGI imminent" or "This is useless" can cause some major emotions.
Thing is, this finding being the case (along with all other LLM limits) does not mean that these models aren't impactful and shouldn't be scrutinised, nor does it mean they are useless. The truth is likely just a bit more nuanced than a narrow extreme.
Also, mental health impact, job losses for white collar workers, privacy issues, concerns of rights holders on training data collection, all the current day impacts of LLMs are easily brushed aside by someone believing that LLMs are near the "everyone dies" stage, which just so happens to be helpful if one were to run a lab. Same if you believe these are useless and will never get better, any discussion about real-life impacts is seen as trying to slowly get them to accept LLMs as a reality, when to them, they never were and never will be.
ticulatedspline 30 minutes ago
> There’s a certain type of person who reacts with rage when anyone points out flaws with <thing>. Why is that?
FIFY, it's not endemic to here or LLMs. point out Mac issues to an Apple fan, problems with a vehicle to <insert car/brand/model> fan, that their favorite band sucks, that their voted representative is a PoS.
Most people aren't completely objective about everything and thus have some non-objective emotional attachment to things they like. A subset of those people perceive criticism as a personal attack, are compelled to defend their position, or are otherwise unable to accept/internalize that criticism so they respond with anger or rage.
stratos123 an hour ago
I tend to be annoyed whenever I see a paper with a scandalous title like that, because all such papers that I've seen previously were (charitably) bad or (uncharitably) intentionally misleading. Like that infamous Apple paper "The Illusion of Thinking" where the researchers didn't care that the solution for the problem provided (a Towers of Hanoi with N up to 20) couldn't possibly fit in the allotted space.
ziml77 an hour ago
I suspect they're afraid that if the hype dies, so will the pace of progress on LLMs as well as their cheap/free usage of them.
undefined an hour ago
nonameiguess 43 minutes ago
It's bizarre as hell. Another response compares it to sports fandom, which tracks. It reminds me of the "flare up" ethos of r/CFB, meaning they believe you're not allowed to comment on anything if you don't declare which NCAA Americal football team you're a fan of, because if you do, then anything you ever say can be dismissed with "ah rich coming a fan of team X" like no discussion can ever be had that might be construed as criticism if your own tribe is not perfect and beyond critique itself.
This is stupid enough even in the realm of sports fandom, but how does it make any sense in science? Imagine if any time we studied or enumerated the cognitive biases and logical fallacies in human thinking the gut response of these same people was an immediate "yeah, well dogs are even stupider!" No shit, but it's non-sequitur. Are we forever banned from studying the capabilities and limitations of software systems because humans also have limitations?
pants2 2 hours ago
Doesn't this just look like another case of "count the r's in strawberry" ie not understanding how tokenization works?
This is well known and not that interesting to me - ask the model to use python to solve any of these questions and it will get it right every time.
wahnfrieden 2 hours ago
It's not dismissible as a misunderstanding of tokens. LLMs also embed knowledge of spelling - that's how they fixed the strawberry issue. It's a valid criticism and evaluation.
Lerc an hour ago
The r's in strawberry presents a different level of task to what people imagine. It seems trivial to a naive observer because the answer is easily derivable from the question without extra knowledge.
A more accurate analogy for humans would be to imagine if every word had a colour. You are told that there are also a sequence of different colours that correspond to the same colour as that word. You are even given a book showing every combination to memorise.
You learn the colours well enough that you can read and write coherently using them.
Then comes the question of how many chocolate-browns are in teal-with-a-hint-of-red. You know that teal-with-a-hint-of-red is a fruit and you know that the colour can also be constructed by crimson followed by Disney-blond. Now, do both of those contain chocolate-brown or just one of them, how many?
It requires excersizing memory to do a task that is underrepresented in the training data because humans simply do not have to do the task at all when the answer can be derived from the question representation. Humans also don't have the ability that the LLMs need but the letter representation doesn't need that ability.
wahnfrieden 22 minutes ago
azakai an hour ago
I do think this is a tool issue. Here is what the article says:
> For the multiplication task, note that agents that make external calls to a calculator tool may have ZEH = ∞. While ZEH = ∞ does have meaning, in this paper we primarily evaluate the LLM itself without external tool calls
The models can count to infinity if you give them access to tools. The production models do this.
Not that the paper is wrong, it is still interesting to measure the core neural network of a model. But modern models use tools.
cr125rider an hour ago
Seems like it’s maybe also a tool steering problem. These models should be reaching for tools to help solve factual problems. LLM should stick to prose.
emp17344 an hour ago
stratos123 39 minutes ago
graemefawcett an hour ago
It's not just an issue of tokenization, it's almost a category error. Lisp, accounting and the number of r's in strawberry are all operations that require state. Balancing ((your)((lisp)(parens))) requires a stack, count r's in strawberry requires a register, counting to 5 requires an accumulator to hold 4.
An LLM is a router and completely stateless aside from the context you feed into it. Attention is just routing the probability distribution of the next token, and I'm not sure that's going to accumulate much in a single pass.
staticshock 2 hours ago
LLMs seem to me closer to Kahneman's System 1 than to System 2. When understood in this way, it is obvious why LLMs are bad at counting r's in "strawberries". But it also makes ZEH feel like it couldn't possibly be a useful metric, because it's a System 2 evaluation applied to a System 1 system.
8note an hour ago
> When understood in this way, it is obvious why LLMs are bad at counting r's in "strawberries".
no it doesnt. it makes sense that they cant count the rs because they dont have access to the actual word, only tokens that might represent parts or the whole of the word
orbital-decay 43 minutes ago
Tokenization is a simplistic explanation which is likely wrong, at least in part. They're perfectly fine reciting words character by character, using different tokenization strategies for the same word if forced to (e.g. replacing the starting space or breaking words up into basic character tokens), complex word formation in languages that heavily depend on it, etc. LLMs work with concepts rather than tokens.
im3w1l an hour ago
A big part of skill aquisition in humans is moving tasks from system 2 to system 1, to free up the very scarce thinking resources for ever more complex tasks, that can then in turn be internalized and handled by system 1.
BugsJustFindMe 2 hours ago
People are going to misinterpret this and overgeneralize the claim. This does not say that AI isn't reliable for things. It provides a method for quantifying the reliability for specific tasks.
You wouldn't say that a human who doesn't know how to read isn't reliable in everything, just in reading.
Counting is something that even humans need to learn how to do. Toddlers also don't understand quantity. If a 2 year old is able to count to even 10 it's through memorization and not understanding. It takes them like 2 more years of learning before they're able to comprehend things like numerical correspondence. But they do still know how to do other things that aren't counting before then.
coldtea 2 hours ago
>Counting is something that even humans need to learn how to do
No human who can program, solve advanced math problems, or can talk about advanced problem domains at expert level, however, would fail to count to 5.
This is not a mere "LLMs, like humans, also need to be taught this" but points to a fundamental mismatch about how humans and LLMs learn.
(And even if they merely needed to be taught, why would their huge corpus fail to cover that "teaching", but cover way more advanced topics in math solving and other domains?)
Topfi an hour ago
Respectfully, toddlers cannot output useable code or have otherwise memorised results to an immense number of maths equations.
What this points at is the abstraction/emergence crux of it all. Why does an otherwise very capable LLM such as the GPT-5 series, despite having been trained on vastly more examples of frontend code of all shapes, sizes and quality levels, struggle to abstract all that training data to the point where outputting any frontend that deviates from the clearly used examples?
If LLMs, as they are now, were comparable with human learning, there'd be no scenario where a model that can provide output solving highly advanced equations can not count properly.
Similarly, a model such as GPT-5 trained on nearly all frontend code ever committed to any repo online, would have internalised more than that one template OpenAI predominantly leaned on.
These models, I think at this point there is little doubt, are impressive tools, but they still do not generalise or abstract information in the way a human mind does. Doesn't make them less impactful for industries, etc. but it makes any comparison to humans not very suitable.
BugsJustFindMe an hour ago
> What this points at is the abstraction/emergence crux of it all. Why does
This paper has nothing to do with any questions starting with "why". It provides a metric for quantifying error on specific tasks.
> If LLMs, as they are now, were comparable with human learning
I think I missed the part where they need to be.
> struggle to abstract all that training data to the point where outputting any frontend that deviates from the clearly used examples? ... a model such as GPT-5 trained on nearly all frontend code ever committed to any repo online, would have internalised more than that one template OpenAI predominantly leaned on
There is a very big and very important difference between producing the same thing again and not being able to produce something else. When not given any reason to produce something else, humans also generate the same thing over and over. That's a problem of missing constraints, not of missing ability.
Long before AI there was this thing called Twitter Bootstrap. It dominated the web for...much longer than it should have. And that tragedy was done entirely by us meatsacks (not me personally). Where there's no goal for different output there's no reason to produce different output, and LLMs don't have their own goals because they don't have any mechanisms for desire (we hope).
[I've edited this comment for content and format]
Topfi an hour ago
nkrisc 2 hours ago
You’re conflating counting and language.
Many animals can count. Counting is recognizing that the box with 3 apples is preferable to the one with 2 apples.
Yes, 2 year olds might struggle with the externalization of numeric identities but if you have 1 M&M in one hand and 5 in the other and ask which they want, they’ll take the 5.
LLMs have the language part down, but fundamentally can’t count.
BugsJustFindMe 2 hours ago
The concept of bigger/smaller is useful but is a distinct skill from counting. If you spread the M&Ms apart enough that the part of the brain responsible for gestalt clustering can't group them into a "bigger whole" signal, they'll no longer be able to do the thing you're saying (this is the law of proximity in gestalt psychology).
adrian_b an hour ago
irishcoffee 2 hours ago
> Counting is something that even humans need to learn how to do. Toddlers also don't understand quantity. If they're able to count to even 10 it's through memorization and not understanding.
I completely agree with you. LLMs are regurgitation machines with less intellect than a toddler, you nailed it.
AI is here!
kenjackson 2 hours ago
Whenveer I see these papers and try them, they always work. This paper is two months old, which in LLM years is like 10 years of progress.
It would be interesting to actively track how far long each progressive model gets...
revachol an hour ago
I just tried it in ChatGPT "Auto" and it didn't work
> Yes — ((((()))))) is balanced.
> It has 6 opening ( and 6 closing ), and they’re properly nested.
Though it did work when using "Extensive Thinking". The model wrote a Python program to solve this.
> Almost balanced — ((((()))))) has 5 opening parentheses and 6 closing parentheses, so it has one extra ).
> A balanced version would be: ((((()))))
Testing a couple of different models without a harness such that no tool calls are possible would be interesting
kenjackson an hour ago
Weird. I tried in chatGPT auto and it worked perfectly. I tried like 10 variations. I also did the letters in words. Got all of them right.
The one thing I did trip it up on was "Is there the sh sound in the word transportation". It said no. And then realized I asked for "sound" not letters. It then subsequently got the rest of the "sounds-like" tests I did.
Clearly, my ChatGPT is just better than yours.
revachol an hour ago
coldtea 2 hours ago
Even more interesting to track how many of those are just ad-hoc patched.
raincole 2 hours ago
Probably zero. At the end of the day people pay for LLMs that write better code or summarize PDFs of hundreds of pages faster, not the ones that can count the letter r's better.
When LLMs can't count r's: see? LLMs can't think. Hoax!
When LLMs count r's: see? They patched and benchmark-maxxed. Hoax!
You just can't reason with the anti-LLM group.
coldtea 12 minutes ago
toraway an hour ago
azakai an hour ago
You are trying it on a production model. The paper is using models with tool calls disabled.
moffkalast 2 hours ago
Yeah well I presume at this point they have an agent download new LLM related papers as they come out and add all edge cases to their training set asap.
Is tokenization extremely efficient? Yes. Does it fundamentally break character-level understanding? Also yes. The only fix is endless memorization.
wg0 2 hours ago
Actually almost all LLMs when they write numbered sections in a markdown have the counting wrong. They miss the numbers in between and such.
So yes.
And the valuations. Trillion dollar grifter industry.
simianwords an hour ago
Can someone produce a single example <20 characters that fails with latest thinking model? Can’t seem to reproduce.
undefined 2 hours ago
dwa3592 an hour ago
Nice! Although I tried the parenthesis balanced question with gemini and it gave the right answer in first attempt.
dwa3592 an hour ago
but it's a tricky question for LLMs; it shows that if it's not in the training set; LLMs could trip which kinda shows that the intelligence is not generalized yet.
I tried this with gemini - (i am trying(something(re(a(l(ly)c)r)a)z)((y)he)re)
and it tripped.
orbital-decay 33 minutes ago
Intuitively this looks like an architectural artifact (like optical illusions in humans) or a natural property of learning rather than a lack of generalization. I have issues with your example too and have to count slowly to make sure.
burningion 2 hours ago
Ran this through Qwen3.5-397B-A17B, and the difference between 4 characters and 5 is wild to see:
> are the following parenthesis balanced? ((())))
> No, the parentheses are not balanced.
> Here is the breakdown:
Opening parentheses (: 3
Closing parentheses ): 4
... following up with:> what about these? ((((())))
> Yes, the parentheses are balanced.
> Here is the breakdown:
Opening parentheses (: 5
Closing parentheses ): 5
... and uses ~5,000 tokens to get the wrong answer.cineticdaffodil 33 minutes ago
Another strange thing is that they just dont know the endings of popular stories. Like olanets that get blown up, etc. they just dont have that material..
parliament32 2 hours ago
> This is surprising given the excellent capabilities of GPT-5.2
The real surprise is that someone writing a paper on LLMs doesn't understand the baseline capabilities of a hallucinatory text generator (with tool use disabled).
coldtea 2 hours ago
The real suprise is people saying it's surprising when researchers and domain experts state something the former think goes against common sense/knowledge - as if they got them, and those researcers didn't already think their naive counter-argument already.
justinator 2 hours ago
One! Two! Five!
aogaili an hour ago
You are polluting future training data.
bigstrat2003 2 hours ago
Let us be very clear: there is no such thing as a trustworthy LLM. Time and again they have shown that they understand nothing. They can be useful in the right context, but you can't trust them at all.
throwuxiytayq 2 hours ago
> This is surprising given the excellent capabilities of GPT-5.2.
Is this seriously surprising to anyone who knows the absolute minimum about how LLMs parse and understand text?
dontlikeyoueith 2 hours ago
Nope.
It's only surprising to people who still think they're going to build God out of LLMs.
charcircuit 2 hours ago
Why didn't OpenAI finetune the model to use the python tool it has for these tasks?
ej88 2 hours ago
They do, in the paper they mention they evaluate the LLM without tools
itsmyro an hour ago
bruh
jeremie_strand 2 hours ago
[dead]
simianwords 2 hours ago
There’s no way this is right. I checked complicated ones with the latest thinking model. Can someone come up with a counter example?
Edit: here’s what I tried https://chatgpt.com/share/69cebb52-56a8-838f-969c-c47308262a...
stratos123 an hour ago
Did you use the exact API call shown in the paper? I am unable to replicate the paper's counterexamples via the chat UI, but that's not very surprising (if the LLM already only fails a few cases out of thousands, the small differences in context between API and chat might fix them).
simianwords an hour ago
pton_xd an hour ago
"in this paper we primarily evaluate the LLM itself without external tool calls."
Maybe this is a factor?
simianwords an hour ago
No tools were used.