Hacker News

by Ryan Harman

GLM 5.2 Performance Benchmarks (artificialanalysis.ai)

121 points by theanonymousone 10 hours ago

wongarsu 6 hours ago

It does really well on "AA-Omniscience Non-Hallucination Rate", far higher than DeepSeek, GPT 5.5 or Fable. I really like that benchmark because it's one of the few benchmarks that allows LLMs to elect not to answer if they are unsure and punishes them for trying to bullshit their way through the benchmark

corlinp 13 minutes ago

That one is a bit sus to me, because the models that do the worst on Omniscience Accuracy do the best on non-hallucination. The top model for this benchmark is "MiniCPM5-1B (Non-reasoning)" which gets a whopping 99% vs 45% for Fable 5.

I'd love to see a good hallucination benchmark, but this isn't one. There's no possibility that a 1B model hallucinates less than Fable 5.

SilverServer 3 hours ago

It took me a while to figure out how to interpret the benchmark correctly, because on the overview page it says "AA-Omniscience Non-Hallucination Rate," but on the benchmark page https://artificialanalysis.ai/evaluations/omniscience#aa-omn...

it said "the lower, the better." Eventually, I realized that the "non" reverses the scores. And indeed, the results are consistent.

andai 4 hours ago

This implies that other benchmarks (for which every AI provider is optimizing?) are actively encouraging bullshitting?

mattalex 2 hours ago

The issue with having a "no answer" option is that you implicitly add a decision problem into your test that depends on the "cost" of answering wrong.

Specifically, your model now has two "correct" classes p(class=y|x) and p(class=⊥|x). This makes the results ambiguous. The way you resolve this is by adding in a cost of missclassification and a cost of answering wrong.

L(y, y') =

0 if y=y' l_err if y≠y' and y'≠⊥ l_⊥ if y' = ⊥

You can then estimate the expected error over your dataset. Notice that this now gives you additional degrees of freedom: Depending on how expensive answering wrong is compared to not answering at all, your predictor might be really bad or really good.

This means when benchmarking with a "no answer" action, you are often not actually benchmarking whether the model works well or not, but rather are benchmarking how well the model _happens_ to agree with the class-error weight you (implicitly) chose in your model.

WarmWash 3 hours ago

There is a tradeoff where as factual accuracy increases, creativity decreases, and the model becomes more "rigid" and less general. Unfortunately it seems that creativity is a good quality for reasoning and ultimately problem solving.

So we have a situation where models that can solve challenging problems, also tend to have problems with hallucinating, but those hallucinations seem be the breeding ground for the solutions that got them high "Wow" factor intelligence.

wongarsu 3 hours ago

Yes. Most benchmarks just measure how many answers are correct. The best way to optimize that is to confidently state something, in hopes it's correct. Which is exactly how most LLMs behave, despite plenty of evidence that they do know whether they "know" something

Imustaskforhelp 3 hours ago

whimblepop 3 hours ago

Bullshitting is how LLMs work. It doesn't require active encouragement. All it takes is a machine without consciousness or physical access to the world and an actually-lived life. A training set that contains lots of confident answers and few to no refusals doesn't help either.

otabdeveloper4 3 hours ago

trouve_search 2 hours ago

A lot of benchmarks are setup to not punish false positives (irrelevant answers or extra text) and punish false negatives (missing the snippet being looked for).

This leads to answer bloat and/or hallucination if you benchmaxx on those

Zababa 3 hours ago

They are, especially multiple choice questions. The same happens with humans exams:

Let's say there are 100 questions, with 4 answers each. A good answer is worth 1 point. By just guessing you get an average of 25/100, way more than 0/100 by not replying.

If instead a wrong answer is -1 point, by just guessing you get on average -75/100, way worse than 0/100.

gertlabs 30 minutes ago

On our multi-agent coding and reasoning evaluations, GLM 5.2 is the first model we've tested that crossed the threshold of being on par with or better than Opus 4.6 (although as usual, we have GLM 5.2 and most other Chinese models a bit below most other benchmarks with test methodologies that are more vulnerable to benchmaxxing).

Data at https://gertlabs.com/rankings

lanycrost 6 hours ago

It's always nice to see how open source models growing, hope we will have good performance with lower tier hardware some day.

theturtletalks 5 hours ago

I want to trust their benchmarks but when they have Muse Spark over GPT-5.5, it gives me pause.

mdasen 3 hours ago

Where do you see that? I see they have GPT-5.5 (xhigh) at 55, GPT-5.5 (high) at 53, and Muse Spark at 43. Muse Spark does beat GPT-5.4 mini (xhigh) which scores 40, but the key there is "mini".

In the coding index, GPT-5.5 gets 59.1, 58.5, 56.2, and 52.1 for xhigh, high, medium, and low while Muse Spark is behind at 47.5. For agentic, GPT-5.5 gets 74.1, 72.0, 69.4, and 59.7 (xhigh, high, medium, low) while Muse Spark gets 62.0 (beating only GPT-5.5 low).

GPT-5.5 only gets beaten by Opus 4.8 in their general index, is the top spot for coding, and is #3 behind Opus 4.8 and GLM-5.2 for agentic (excluding Fable 5 which takes the top spot, but is unavailable).

XCSme 5 hours ago

I also tested it[0]: quite similar to GLM 5, a few percent better, 30% faster and 50% more expensive.

[0]: https://aibenchy.com/?q=glm

benxh 4 hours ago

benchmark where gemini flash is better than fable btw.

XCSme 3 hours ago

Well, most people were not liking Fable when it was available anyway, because it refused to answer questions very often.

margalabargala 3 hours ago

XCSme 5 hours ago

PS: Just added a cool feature, so you can filter the leaderboard for multiple models at once, by using a comma, like: https://aibenchy.com/?q=glm,claude

lousken 5 hours ago

still 1/4 of the price of anthropic and openai models though

hemkeshr 4 hours ago

Local models are already useful today. The next milestone is getting this level of performance onto truly affordable hardware.

SV_BubbleTime 3 hours ago

NVidia has less than zero reason to ship cards ideal for this at low prices.

AMD’s stock price reflects a hope they launch a CUDA alternative. But this is unlikely for the near future.

There is a lot of interest in preventing China coming in with cheap AI hardware.

So I expect the direction to be good local models that few can run effectively.

theplumber 3 hours ago

The Chinese will flood the market with cheap AI chips just like they did with EV cars. As consumers we can’t thank them enough.

omnimus 2 hours ago

binary132 3 hours ago

DeathArrow 6 hours ago

One or two more releases and they will reach Fable level.

vitalyan123 5 hours ago

by then there will be Fable 5.21, again 5% ahead of every other SotA while still only 500% the size.

mjhay 3 hours ago

There’s no way Anthropic can keep jacking up the prices like this for every marginally better model. I think even tokenmaxxing companies are going to soon balk at $50/million output tokens.

theplumber 3 hours ago

sourcecodeplz 5 hours ago

still quite verbose at 140m output tokens, but this is on max thinking. high should do better.

ChrisArchitect 5 hours ago

Some more discussion: https://news.ycombinator.com/item?id=48567759

Hacker News

by Ryan Harman

GLM 5.2 Performance Benchmarks (artificialanalysis.ai)

wongarsu 6 hours ago [-]

corlinp 13 minutes ago [-]

SilverServer 3 hours ago [-]

andai 4 hours ago [-]

mattalex 2 hours ago [-]

WarmWash 3 hours ago [-]

wongarsu 3 hours ago [-]

Imustaskforhelp 3 hours ago [-]

whimblepop 3 hours ago [-]

otabdeveloper4 3 hours ago [-]

trouve_search 2 hours ago [-]

Zababa 3 hours ago [-]

gertlabs 30 minutes ago [-]

lanycrost 6 hours ago [-]

theturtletalks 5 hours ago [-]

mdasen 3 hours ago [-]

XCSme 5 hours ago [-]

benxh 4 hours ago [-]

XCSme 3 hours ago [-]

margalabargala 3 hours ago [-]

XCSme 5 hours ago [-]

lousken 5 hours ago [-]

hemkeshr 4 hours ago [-]

SV_BubbleTime 3 hours ago [-]

theplumber 3 hours ago [-]

omnimus 2 hours ago [-]

binary132 3 hours ago [-]

DeathArrow 6 hours ago [-]

vitalyan123 5 hours ago [-]

mjhay 3 hours ago [-]

theplumber 3 hours ago [-]

sourcecodeplz 5 hours ago [-]

ChrisArchitect 5 hours ago [-]

wongarsu 6 hours ago

corlinp 13 minutes ago

SilverServer 3 hours ago

andai 4 hours ago

mattalex 2 hours ago

WarmWash 3 hours ago

wongarsu 3 hours ago

Imustaskforhelp 3 hours ago

whimblepop 3 hours ago

otabdeveloper4 3 hours ago

trouve_search 2 hours ago

Zababa 3 hours ago

gertlabs 30 minutes ago

lanycrost 6 hours ago

theturtletalks 5 hours ago

mdasen 3 hours ago

XCSme 5 hours ago

benxh 4 hours ago

XCSme 3 hours ago

margalabargala 3 hours ago

XCSme 5 hours ago

lousken 5 hours ago

hemkeshr 4 hours ago

SV_BubbleTime 3 hours ago

theplumber 3 hours ago

omnimus 2 hours ago

binary132 3 hours ago

DeathArrow 6 hours ago

vitalyan123 5 hours ago

mjhay 3 hours ago

theplumber 3 hours ago

sourcecodeplz 5 hours ago

ChrisArchitect 5 hours ago