GLM-5.2 is a step change for open agents (interconnects.ai)
344 points by vantareed 3 days ago
jerojero 2 days ago
Open weight models from Chinese labs tend to be significantly cheaper.
I think theyre absolutely needed. I can't afford 200 USD a month for personal use of coding AI, and I don't think such prices are reasonable for most of the world economy anyway. Not to mention US firms might be giving their employees a lot more than that.
It's increasingly feeling, to me, that theres a gap building up between haves and have nots. But then, we get news of these open weight models that are reasonably priced in inference with reasonable capabilities. Yes, they take maybe 6-9 months to get there, tbh, that's not a bad trade off at all.
fbrncci 17 hours ago
You made me realize something. I routinely spend upwards of 500$ per month on LLMs for coding (expensed towards clients). However I live in a place where 500$ is around the avg. salary. I’m lucky that I know my way around western clients. Clients who pay these expenses and are happy to work with me because I am still about 50% cheaper than local talent in EU/US, while my salary at home converts to an upper class income at the highest tax bracket.
Which of course causes some unfairness on both ends. Nobody here can compete with me. I often use left over tokens on local client projects; which despite lower pay, still pays off because they now take hours not days or weeks to complete. And nobody in the local clients talent pool can compete with me; unless they charge about half the market rate.
Take away my 500$ monthly grant; and I’d be more or less screwed. Better open models will more or less start to reduce this advantage. It’s not like I positioned myself here on purpose. But it’s definitely a „right place, right time“ situation.
whazor 6 hours ago
The problem is that the differences between flagship and local models are compounding heavily. An 4% different could be massive when you keep iterating on the same code base.
swiftcoder 4 hours ago
listic 15 hours ago
Thanks for sharing your insight.
Mind if I ask you for a few vibe coding tips? I failed to solve you gh puzzle in the profile though.
swader999 17 hours ago
If you are running multiple agents your cost to them should be multiples less what their roi is.
fbrncci 16 hours ago
lanthissa 8 hours ago
AI is the first technology that doesn't incentivize offshoring, and incentivizes co-location of talent.
A NYC dev and a dev in india have the same ai costs, based the ratio tokens/salary it becomes less of comparative disadvantage to be in NYC.
Now combine that with the fact that AI makes the act of generating code less a % time of the job, and the ability to get/refine requirements more of the job and you have a decent shift.
Sammi 7 hours ago
Fr0styMatt88 18 hours ago
If we can agree that the AI model is at least as capable as a junior engineer or new contractor, how’s that different to saying “software engineering isn’t worth $200 a month”?
Has a very race-to-the-bottom feel to it.
Though in the grand scheme of it, $200/mo probably isn’t the real price either. Also looking at it not just in a vacuum - paying for a product that can change what you get from under you doesn’t seem great anyway.
At least with a locally-hosted model you know what you’re getting.
matheusmoreira 17 hours ago
Yeah. There's no way to verify what these providers are doing. The real future is running these models at home. Opus level inference on our own hardware would be a dream come true.
baq 6 hours ago
IncreasePosts 16 hours ago
RazorBucksICO 5 hours ago
The appropriate price is what the output is worth to you. Some people could pay $10,000/month, some $5 and feel like they were breaking even. There is a big jump between convenience and curiosity uses versus business critical.
OpenAI already charges enterprise users a premium purely for that title over on-demand, no-contract usage. Retail users get a good deal. People make a lot of hay about subsidies but this is a very sane approach if you want exposure to these three different types of customers.
cameldrv 23 minutes ago
Yes, but you’re paying with your data unless you’re hosting with a provider you trust or self-hosting.
tacomagick a day ago
DeepSeek through their own API has saved me tons of tokens honestly. Even though it is not as smart as Kimi or Claude, their level of entry is very low with a top up of 2$ and Pay as you go compared to the subscription of Claude or 20$ top up of Kimi
praveer13 20 hours ago
For personal use I’m considering using the frontier models from openai or anthropic to create a plan with research and brainstorming etc with enough details for cheap models to be able to follow (glm, deepseek etc) - with openrouter - will monitor how cheap and effective that turns out to be.
ImaCake 18 hours ago
lionkor 11 hours ago
tacomagick 12 hours ago
mdjxnxnxnd 8 hours ago
giancarlostoro 4 hours ago
As much as I don't like Mark Zuckerberg, part of me wishes he would get his head in the game and compete with these models, he's literally got all the capability to do so, and he could easily sell the model through deals with GCP, AWS, and Azure. Hell, Amazon needs a hot model they can host that's exclusive to them I feel like, maybe he can work something out with them, whatever the case, it seems so glaringly obvious to me, I'm not sure why he hasn't taken a stab at competing with Claude Code or at least frontier open models and then cutting a deal with cloud providers to recoup the costs of maintaining said models.
He's sitting on a frontier model letting it burn a hole in his wallet that could actually pay for itself.
khurs 4 hours ago
Meta internally have been using Google Gemini
"Meta has been using Google’s Gemini large language model for most of its moderation and customer support, but staff have recently been told to switch to Meta’s new foundational model, Muse Spark, the people said."
https://www.ft.com/content/39251a31-4a9d-4870-b86c-dc6353d67...
giancarlostoro 4 hours ago
arikrahman 18 hours ago
Someone else on this forum put it well, U.S. is trying to achieve AGI at all costs, while Chinese models are seeking widespread adoption.
rglullis 8 hours ago
> U.S. is trying to achieve AGI at all costs
If that was true, they would be collaborating with each other and opening up all the results from their work.
lionkor 11 hours ago
None of the AI companies in the US are on the path to AGI. They are, however, on the path to claiming they have AGI, then subsequently not releasing it and only giving it to the US government to make drones that can bomb the homes of political dissidents.
dotancohen 9 hours ago
azinman2 17 hours ago
I don't think anthropic/openai/google aren't also seeing widespread adoption. In fact they already have they already have the marketshare.
Turskarama 13 hours ago
tsss 9 hours ago
Everyone wants widespread adoption, of course. I'm sure that China is also working on more expensive frontier intelligence models behind doors, but they're lagging behind America on that front. Going for cost-optimized open weight models is their bet to stay relevant in a market where they can't compete for the "luxury" segment. It is important for them to get a foot in the door and maintain a presence in the press to attract future customers, given the general animosity towards China in the west that they need to overcome. Similarly, European providers like Mistral are hopelessly outclassed in every respect and thus try to carve out a niche in the market with regulation and anti-American fearmongering. They position themselves as "privacy-conscious" not out of goodwill but because it is their only chance to survive as a company with an utterly inferior product.
ImaCake 18 hours ago
Significantly cheaper than comparable models if you are using openrouter [0]. Just yesterday I spent roughly 13 cents centering some divs using Deepseek in a personal project. It would have been north of $1 to do that with a US frontier model.
0. https://openrouter.ai/compare/z-ai/glm-5.2/anthropic/claude-...
ipaddr 11 hours ago
For centering divs the free models opencode offers can easily handle that work. DeepSeek V4 Flash is pretty decent.
ImaCake 7 hours ago
narrator 7 hours ago
The tokens cost the same everywhere on earth. This does hurt some cost advantages of outsourcing when tokens start to become a bigger part of development costs.
brian-armstrong 12 hours ago
I read these stories and I can never figure out how people are managing to use these $200 plans. If I really go full bore, I can sometimes max out the $20 plan. Even then, it already produces more code than I can reasonably review and merge.
ipaddr 12 hours ago
I've maxed out my chatgpt plus the first week and that include an smf forum rewrite. Trying my best I haven't been able to max out again. Things are setup that you need to max out your 5 hour window multiple times which becomes a job in itself.
At work I'm struggling to keep my claude bill around $500.
girvo 8 hours ago
Simple: a lot of the people claiming they’re reviewing the output of these models are lying.
Also if you run the “loops” they’re now yapping about, it will burn through enormous amounts of usage as well.
theoli an hour ago
hgomersall 7 hours ago
RugnirViking 7 hours ago
do you do it for a job (8 hours a day)? and do you work in large, mature projects (more than 5 team members)? A big part of it is dealing with frankly terrible architecture and 15 people's different ideas of how things should work (and the spam theyve been able to do with their own agents makes this worse)
matheusmoreira 17 hours ago
> It's increasingly feeling, to me, that theres a gap building up between haves and have nots.
People speak of a permanent underclass.
https://www.nytimes.com/2026/04/30/opinion/ai-labor-work-for...
alpineman 11 hours ago
With open weight models there is true inference competition. Whoever can serve the model at the lowest price. And the consumer wins. Capitalism, served by China.
throwaway-blaze 18 hours ago
Just don't ask it to tell you the events of June 4, 1989.
swingboy 7 hours ago
My work involves asking LLMs about both Tianenmen Square and what’s going on in Gaza, so I can’t use Chinese or American models!
girvo 8 hours ago
Not that it matters but most of the open weight models aren’t actually censored that way: they run another layer on top of to do that. At least some of them do, Step 3.7 Flash locally happily tells me about the Tiananmen Square massacre
ttoinou 2 days ago
200 is much less than the value you’re supposed to get out of it. If it’s not then yeah go ahead and use cheaper models with worst quality
martinjc 20 hours ago
Are you aware of how much purchasing power 200 dollars is in china, brazil, thailand or india is? This is an extremely arrogant take.
nwienert 17 hours ago
dash2 12 hours ago
mrngld 5 hours ago
matheusmoreira 17 hours ago
Dayshine 2 days ago
I'm not sure how I'm supposed to get $200 of value out of personal use!
LPisGood 20 hours ago
devmor 20 hours ago
holoduke 19 hours ago
uberex 19 hours ago
Unless that value is $200 cash in hand it will be hard to afford it for people who just don't have $200.
margalabargala 19 hours ago
Last time you bought a computer, did you buy the absolute fastest best CPU available?
girvo 18 hours ago
smrtinsert 19 hours ago
I've actually come to believe the overwhelming majority of use cases require nowhere frontier quality so there's that. Much faster execution is just a bonus on top of the much reduced cost
geye1234 13 minutes ago
Curious to hear if anyone has tried running the 2-bit or 3-bit quantization of this. With a bit of investment I may just be able to swing it locally. I already have 96GB VRAM, so with 192GB RAM, which seems to be the most one can find these days with a 4-slot motherboard, I may be in with a shot. Yes, it'd be slow, but I could give it overnight jobs. But I don't know if running at such a low quantization would make it hallucinate with only a small context.
Qwen and Gemma are great, but they need babysitting every 30 mins, which is quite a cognitive load.
christophilus 17 hours ago
I've been working with Deepseek V4 Flash (with opencode as the harness). It's been almost indistinguishable from Codex / Claude Code for me. I'm sure I'll run into problems when I get to a stickier ticket to tackle. But so far, it's been quite good, and I find it writes straightforward code.
I do think the Chinese models are good enough for an 80/20 rule use case.
mark_l_watson 5 hours ago
I also use DeepSeek v4 flash and v4 pro, but I can’t settle between using Claude Code or OpenCode and it seems like I waste time switching back and forth (especially keeping my personal SKILLs files synced). On one hand, a ton of engineering work has gone into Claude Code, on the other hand all Chinese models I have tried with OpenCode seem well configured out of the box.
I was thrilled to have Gemini Ultra for a month and use as many Opus tokens with AntiGravity as I could use, but I am happier using less capable models like DeepSeek knowing that it is more fun to do more of the work myself, it is a smaller hit on the environment, and incredibly cheaper.
scottchiefbaker 15 hours ago
I tried Deepseek V4 Flash with very low expectations and was pleasantly surprised. It's a surprisingly capable model for the price.
timcobb 15 hours ago
What provider(s) do you use?
saaspirant 9 hours ago
vagrantJin 5 hours ago
That v4 quality is available to everyone in the world for a pittance is beyond remarkable.
solarkraft 6 hours ago
I use Pro because I’m insensitive to the price difference, but also found Flash very capable in OpenCode.
nunodonato 9 hours ago
it would be a really great option if it didn't lack vision
pizzafeelsright 14 minutes ago
this is mcp or custom call to lowest cost model
someone did a webcam + agentic + capture of other computer bios/boot -> upload to image model -> back to agent
RugnirViking 7 hours ago
what do you use vision for? I have failed to find a workflow with it that makes sense, asking it to review screenshots of websites or whatever it misses extremely obvious details like text flowing out of it's container/overlapping other text, things being in entirely the wrong place, etc.
bckr 3 hours ago
cromka 8 hours ago
For coding?
guybedo 17 hours ago
GLM-5.2 has been a step change in how fast i can burn through tokens.
I subscribed to their max plan to try it out. It counted me 700M tokens and drained my weekly quota in under 2 days.
Quota just reset less than 24h ago and i'm already >60% weekly quota usage.
For reference the kind of work i did would have used somewhere between 3% and 5% of Codex max or Claude max.
The model is good, the plan is a scam
try-working 14 hours ago
Kimi and GLM models have coined a new term: Thinkslop. They run a chain of thought that is up to 10x longer than other models and it seems that through a lookback mechanism they are able to use the CoT to reason about solutions to tasks they couldn't otherwise solve.
The downside is of course that they consume many more tokens off your plan, and also that they are significantly slower. Kimi K2.7 takes about 7x longer to finish the same benchmark tasks as DeepSeek V4 Pro on my router benchmarks (https://role-model.dev/).
So for now I'm happy with just two models: GPT and DeepSeek.
PhilippGille 9 hours ago
> Kimi and GLM models have coined a new term: Thinkslop. > [...] > So for now I'm happy with just two models: GPT and DeepSeek.
1. DeepSeek V3.2, V4 Flash, V4 Pro, at high or max thinking, ... when recommending a model it should always be a precise model, not just an AI lab
2. DeepSeek V4 Flash at max thinking is the most verbose model (among top models) in the AA benchmarks. See the "Intelligence Index Token Use" chart: [1]
[1]: https://artificialanalysis.ai/models?models=gpt-5-5-high%2Cg...
try-working 5 hours ago
guybedo 14 hours ago
yeah Kimi K2.7 was doing ok but was painfully slow. The coding plan limits were good though.
I haven't tried deepseek yet, i should check this one out.
try-working 12 hours ago
spwa4 11 hours ago
Turning up the thinking (max time spent thinking) lever really changes model performance, even for tiny models. But it's really irritating because it adds a lot of time.
thefourthchime 2 hours ago
I gave it my standard:
"Make a pac-man game in a single html page"
It went off and argued with itself for 20 minutes about how to lay out the map and then timed out.
jubilanti 17 hours ago
> The model is good, the plan is a scam
If it is needing to generate that many tokens to do the same tasks, then it probably has higher inference costs. So (for you) the model is bad, the plan is the same plan.
anatoliikmt 15 hours ago
What kind of tasks have you been using it for?
aunty_helen 19 hours ago
I signed up to a z.ai max account, $144. Hardly been able to use it as it 429s on most requests. They’re also refusing to refund me.
rescbr an hour ago
For me it works fine outside of afternoons Beijing time on weekdays.
osti 18 hours ago
Even as a GLM z.ai fan, I wouldn't pay for their plans. They are just way worse values than gpt or anthropic plans, in terms of both usage and capabilities.
ticoombs 6 hours ago
Opencode Go subscription has served me well.
reissbaker 14 hours ago
Self-promo but you should try our service synthetic.new. We generally have up-to-date open-source LLMs on the sub, and we have GLM-5.2 :) Perf+stability should be wayyy better than zai.
dotancohen 9 hours ago
What do you do differently that you expect to have better performance than an experienced, established player?
mtlynch 5 hours ago
guybedo 17 hours ago
same here. Barely usable due to API connections issues.
And when i can use it, it just drains the quota 5 times faster than codex or claude.
Their plan is a scam
sergiotapia 18 hours ago
My experience as well unfortunately :(
timcobb 19 hours ago
Can people share their GLM and open model setups in general please? What provider do you use. Why do you trust it with serving full quality? What harness do you use? Why do you trust it not to have malware (most harnessed are TS apps). I am just trying GLM 5.1 from Nvidia build in open code would love to hear how you all do it, thanks.
59nadir 9 hours ago
> What provider do you use?
1. My own harness + Local (which usually means Qwen3.6-35B-A3B), I use this fairly often for research gathering on topics, info gathering on code bases, etc.
2. My own harness + DeepSeek v4 Flash served by DeepSeek, I added $20 quite some time ago and somehow still have $18.77 in there after I don't know how many prompts. I use this pretty often, slightly less than my local setup, it's great and what I'm planning on running locally (eventually).
3. My own harness + OpenRouter with whichever model I want to try out. I use this very rarely.
4. Pi + OpenAI Codex $20 subscription. I don't use this almost at all anymore, but I keep the Codex subscription for testing things out to see how GPT-5.5 will handle a problem the other setups have issues with.
> Why do you trust it with serving full quality?
The only thing I've noticed seems unbearably useless sometimes versus what I noticed before was GPT-5.5 which has had some of the weirdest degradations I've seen. It's not to Anthropic levels but it definitely had some service issues a few times where I was wondering if they had accidentally (or purposefully) lobotomized it.
Everything else has mostly just been the same, except DeepSeek I noticed had some speed issues a few days ago.
> What harness do you use? Why do you trust it not to have malware (most harnessed are TS apps)?
I pretty much only use my own, agents are trivial to make and it's definitely not hard to make one that's better than Claude Code or Codex for whatever you're doing.
mark_l_watson 5 hours ago
I want to say that I agree with you on the value of writing your own coding harness. I wrote something simple in Emacs Lisp and it makes me happy occasionally using it. I am trying to learn Rust and I am working on my own Rust core orchestration layer and I plan on both a Rust command line client and I already have a Python library wrapper for the Rust code that I have written so far. I write a lot of ‘little books’ and I am almost sure to write yet another one on my current hacking project.
Are my little hacks as effective as OpenCode or Claude Code? No way, but I am learning a lot and having fun.
rescbr an hour ago
Z.ai legacy Pro coding plan which will last me until the end of the year + maki.sh as the agent.
OpenCode works fine, i just find it very resource intensive for no good reason.
michimagdesign 18 hours ago
Next to my Claude Pro plan, I have subbed to OpenCode Go. I find the OpenCode UX much better than in Claude Code CLI. As for models, I started a few months ago with GLM 5.1 and it was solid and could archive near sonnet-level tasks. It weirdly sputtered out Chinese characters sometimes. Then I switched to Kimi K2.6, which is the Chinese model I used the most until now. It used way too many reasoning tokens (improved in k2.7). But executed Claude created plans reliably. Now I’m back with GLM 5.2 and it’s really solid (among other things it’s good at design) and I get good usage with the $10 plan. Still the Claude models have less hiccups but the Chinese models are getting really close.
mark_l_watson 5 hours ago
OpenCode Go looked intriguing and I spent time reading their docs and pricing but didn’t purchase services. Do you think they are running it at a loss to get market share? (Probably not.) I have been happy buying tokens directly from DeepSeek (I am retired and everything I do is open source code and writing open content books (the manuscript files are available along with the source code) so I have no privacy issues). I also use FireWorks.ai to try different models. Both API services are excellent, but I may try OpenCode Go for a month or two to support the devs of OpenCode.
pramodbiligiri 27 minutes ago
johndough 10 hours ago
> What provider do you use.
OpenRouter with pinned DeepSeek provider or OpenCode Go > Why do you trust it with serving full quality?
Quality seems good so far. > What harness do you use? Why do you trust it not to have malware (most harnessed are TS apps).
I wrote my own. A minimal harness without dependencies is only 65 lines of Python.gandreani 18 hours ago
I use both the openai subscription and the opencode go subscription. I use the go subscription for my personal work and the openai subscription for my consulting work.
The differences between the models are minimal, but I usually stick with gpt-5.4-mini, gpt-5.4, mimo-pro-2.5, deepseek-v4-pro. These latter ones have way more usage than even using 5.4-mini so I tend to use them in personal projects for that reason.
My harness is https://github.com/can1357/oh-my-pi. I trust it...enough. It updates very frequently so as a safe guard I run it sandboxed with https://github.com/containers/bubblewrap so it can only access the project folder and some whitelisted config files
timcobb 18 hours ago
Thanks. I was looking at open code go yesterday and I couldn't figure out if the base pricing is including usage or if that's just base pricing and then you have to pay for usage too. How does it work? It is very cheap.
arcanemachiner 17 hours ago
chess10kp 2 hours ago
Pi is great, set it up with a system prompt to give the model more direction and think less, and it crushes anything I give it
smoe 18 hours ago
For work, I mostly use Codex and some Claude. For personal use, I’ve started using Chinese models directly through their respective providers, mostly for automation tasks and experiments so far, either via the API directly or through the Pi harness.
I do not trust any of them. Everything runs inside virtual machines, not just the sandboxes provided by the harnesses. I also do not run Claude or Codex directly on the host machine. Not just because of supply chain fears, but also because of how incredibly user hostile the VC funded companies are when it comes to installing random stuff on your machine.
ukuina 14 hours ago
Synthetic.new and Claude Code using GLM-5.2. Great model, but the harness will error out if using subagents. The base plan only allows one concurrent request at a time. Also, GLM will burn through your weekly quota in a day if you're not precise with your scope.
Fr0styMatt88 14 hours ago
Local using Qwen3.6-27B; 2xRTX 5070Ti graphics cards; VS Code with Cline at the moment and Ollama back-end (will get to trying the others soon).
rainmaking 19 hours ago
GLM 5.2 coding plan- I'll post the agent as soon as I can! But opencode works and their own zcode is really good as well.
nullbio 11 hours ago
The idea of an open-weight Mythos model is not scary at all. This space is moving so quickly that it'll looked at in 1-2 years as childs play.
Zopieux 9 hours ago
I don't understand those takes.
Open-weights perhaps, but definitely not self-hostable – since those require $20k+ capex – which is the real "step change" to me, as it ends the stranglehold providers have over censorship.
The only silver lining would be increased competition in API providers of those open-weight models leading to truly affordable prices and a race to remove stupid "safety" checks.
mlmonkey 17 hours ago
Here are the numbers from their bar chart:
1. SWE-bench Pro
Model Score (%)
GLM-5.2 62.1
GLM-5.1 58.4
Claude Opus 4.8 69.2
GPT-5.5 58.6
Gemini 3.1 Pro 54.2
2. Terminal-Bench 2.1
Model Score (%)
GLM-5.2 81.0
GLM-5.1 63.5
Claude Opus 4.8 85.0
GPT-5.5 84.0
Gemini 3.1 Pro 74.0
3. NL2Repo
Model Score (%)
GLM-5.2 48.9
GLM-5.1 42.7
Claude Opus 4.8 69.7
GPT-5.5 50.7
Gemini 3.1 Pro 33.4
4. DeepSWE
Model Score (%)
GLM-5.2 46.2
GLM-5.1 18.0
Claude Opus 4.8 58.0
GPT-5.5 70.0
Gemini 3.1 Pro 10.0
5. ProgramBench
Model Score (%)
GLM-5.2 63.7
GLM-5.1 50.9
Claude Opus 4.8 71.9
GPT-5.5 70.8
Gemini 3.1 Pro 39.5
6. MCP-Atlas
Model Score (%)
GLM-5.2 77.0
GLM-5.1 71.8
Claude Opus 4.8 77.8
GPT-5.5 75.3
Gemini 3.1 Pro 69.2
7. Tool-Decathlon
Model Score (%)
GLM-5.2 48.2
GLM-5.1 40.7
Claude Opus 4.8 59.9
GPT-5.5 55.6
Gemini 3.1 Pro 48.8
8. Humanity's Last Exam
Model Base Score (%) Score w/ Tools (%)
GLM-5.2 40.5 54.7
GLM-5.1 31.0 52.3
Claude Opus 4.8 49.8 57.9
GPT-5.5 41.4 52.2
Gemini 3.1 Pro 45.0 51.4
Seems to be handily beating Gemini 3.1 Pro. What _is_ Google DeepMind doing (other than bleeding talent to A\ ) ?vineyardmike 14 hours ago
> What _is_ Google DeepMind doing
I feel like it has been pretty visible about what’s happening, between their press and products and financial statements. It’s just not what people are accustomed to expect.
First, Google has become a major compute provider for competitors, thanks to TPUs. They’ve talked about allocating TPUs to GCP instead of their first party products. I can only assume it’s because they’re collecting a higher margin, and it covers the cost of data center buildout - which they’ve been aggressively doing. I wouldn’t be surprised if they made the financial decisions to delay or slow training for Gemini 3.5 when they provided last minute compute to Anthropic this spring.
Second, Gemini has very directly not been focused on agentic coding, maybe 3.5 Flash being the change. They’ve built models they can deploy to watch YouTube videos, Nest cameras, scale to AI in search, understand fitness info in Fitbit, etc. They’re very clearly not focused around agentic/coding. They’ve put in a ton of efforts into multimodal data in and out, and they’re the only major lab working on video generation still. There was leak/rumor that their cofounder (brin) was getting involved in the model training to renew focus on agents so maybe this will change, and again 3.5 already feels different.
linzhangrun 11 hours ago
Just waiting for the 3.5 Pro they said would come out this month. Gemini is pretty much useless for any serious work right now.
verdverm 15 hours ago
copying the graphs and tables to HN is noisy and harder to read
JSR_FDED 14 hours ago
Still more helpful than this comment
sibellavia 11 hours ago
While I agree with the post in its entirety, I think it would have been worth mentioning DeepSeek V4 Flash as well, which, in my view, had already reached a sufficient, if not high-level of agentic coding before GLM 5.2 (see DwarfStar).
ramon156 10 hours ago
I know very little about the current state of replacability of Opus but I do sometimes imagine a reality where Opus has been rebuilt as an open model. What plan does Anthropic have when it does happen?
Will they still rent out their own model, will they support the open model and become a resource provider? Will they be able to repay the billions of dollars ?
This is probably the first question I would ask someone from Anthropic, if I ever meet one.
olmo23 5 hours ago
> Will they still rent out their own model, will they support the open model and become a resource provider?
Anthropic rents GPUs from xAI to run Claude. If there's an open weights competitor to Opus, why wouldn't Elon host it directly?
alpineman 6 hours ago
Did you read the article? Opus 4.5 has essentially been rebuilt already
mrngld 5 hours ago
Based on DeepSWE, Opus 4.8 gets you more intelligent output at lower price (GLM's token inefficiency is really biting them). GPT5.5 even moreso. And I don't recall about Opus but GPT is much, much faster at getting you the answer (again, GLM's token inefficiency).
It's neat, I guess, that we can compare them against models released last year, but I care about my options today, and the pareto frontier is about as far away as it ever was.
Add on top of that the extra features OpenAI and Anthropic have in their apps and...
alpineman 2 hours ago
fraywing 17 hours ago
It feels like the gap is closing from an intelligence perspective. Or at least doing some kind of log flattening.
Been playing with GLM 5.2 in different contexts. It's less good if you don't max out thinking, but as xhigh it's been able to solve most problems I was throwing at Opus in the about the same amount of time (via OpenRouter).
Wild time to be alive.
JSR_FDED 14 hours ago
Anecdote, not “research”:
Yesterday I compared Deepseek, Kimi 2.6, MiMo 2.5 and GLM 5.2 for the same task (replace a custom token-based auth scheme with a cookies-based scheme across a front- and back-end codebase).
I used Opencode with the zen subscription to try different models.
All did this perfectly, basically indistinguishable from each other. However, when I pointed out that the new cookies-based auth didn’t allow multiple independent logins across browser tabs (which the previous scheme did allow) I noticed this:
Deepseek, Kimi, MiMo started giving me multiple options but advocating strongly that I should either accept this deficiency, or don’t use the cookies version (keep the old auth scheme). They were so similar it was as if they were all the same model.
Only GLM 5.2 said “here’s how to use cookies and also have tab-level separation”. The difference vs the other models was very stark.
melodyogonna 7 hours ago
American AI labs really need to start releasing good open-weight models.
fabijanbajo 7 hours ago
Agreed. Even just distilled versions of their frontier models would be a huge win for the open ecosystem
themgt 20 hours ago
I just tested GLM 5.2 out via Z.ai in pi for a little one-off project that was already scoped. It actually did a relatively decent job starting out, and figured important things out from context.
But the reasoning traces became increasingly hilarious, with it getting confused and going in loops, doubting itself. I began to feel almost sad, it was like listening to the internal monologue of someone with anxiety disorder.
It made pretty good progress but wound up going in a lot of goofy loops and doing things a bit "off" from standards I'd hoped it would infer, and finally started going a bit nuts, "This is very confusing.", "OH WAIT", seemingly hallucinating a whole side-quest that didn't make sense and looking at making internal system changes to try to achieve its (now very confused) goal when I pulled the plug.
Without seeing the reasoning traces from Claude/GPT it's hard to really know, but it definitely didn't feel like the same quality of reasoning, even if dogged persistence does wind up actually working eventually.
dools 17 hours ago
The reasoning traces always look terrible and they’re frustrating to watch. It’s the same with Kimi. What’s interesting is that the end result is then good. I think it’s just some sort of devils advocate trick to get better output.
rufo 14 hours ago
The reasoning tokens are really just there to extend the amount the LLM can "compute" the problem; put another way, the only way a given model can "think" more about a problem is to fill more of its context with predicted tokens, which has the effect of increasing the accuracy of each token. The reinforcement learning these models go through generally doesn't care what the chain of thought tokens look like (outside of preventing loops/gibberish/reward hacking), only how good the final answer is - so while it does look something like "reasoning" to us and has a rough correlation with the final answer, treating it as actually representative of what the final answer will be or an actual thought process is giving those tokens too much credit :)
fc417fc802 13 hours ago
teravor 14 hours ago
as compared to what though? you can't see the actual think traces for opus or gpt.
dools 13 hours ago
try-working 14 hours ago
thinkslop recursion.
eunos 5 hours ago
I have a hilarious theory why GLM (and Kimi) have this thinkslop,
apparently Chinese language as token is more information dense than English, so having these wasteful thinkslop in Mandarin isnt that damaging. So the developer focus mostly in Mandarin and didnt think of handling these thinkslop while American AI labs do.
jauntywundrkind 19 hours ago
I think the self-doubt might actually be a very crucial part of it's capability. I often feel compelled to interrupt when I'm watching it think (which thank the stars it let's us do, unlike the big American models!!), but usually it makes the right pick!
Being willing and able to reconsider seems very good. Going around and around, pulling in more thinking, integrating it: maybe that's why it is as good as it's good.
I want to emphasize again how excellent it is that we can see the thinking. I think this makes GLM so much better an experience for me. It gives me such insight into what is being considered, helps me see where things go wrong. It grounds me, gives me the notion of where the results come from. It was so jarring to switch to GPT and Opus and find that they won't discuss with me, won't reveal their thinking: that feels fundamentally unsafe, for me, for society, to have such a severe black box. I don't think it should be allowed, honestly.
Many thanks to this recent submission, which is the first time I've seen anyone blog about this core difference: The text in Claude Code’s “Extended Thinking” output is not authentic. https://patrickmccanna.net/the-text-in-claude-codes-extended... https://news.ycombinator.com/item?id=48630535
wuhhh 19 hours ago
Your post made me laugh because I experienced the same as you but the other way around. I switched from Claude to a multi model harness a couple of days ago and the first model I tried was GLM5.2.
I gave it some simple code porting exercises and watched dumbfounded at the reasoning, which was more like the ravings of a lunatic - but lo and behold, after much confusion and a dizzying number of eureka moments the task was completed very successfully.
I tried Kimi on a similar task, much faster, a little more reassuring somehow in its ramblings, also surprisingly good results.
To be clear, I’m not surprised the results were good because they’re not GPT or Claude, but because the line of reasoning was so bonkers. Coming from Claude, I was just not used to seeing this, but I’ll bet it’s just as nuts with the frontier models and we’re just not allowed to see it (I’m about to read the links you shared).
Agree wholeheartedly that transparency is of grave importance.
nl 17 hours ago
rainmaking 19 hours ago
neosat 18 hours ago
I've been using GLM 5.2 recently (company hosted, for non-coding tasks) and it's been strong and reliable. There are areas where GPT 5.5 and Opus 4.x still feel marginally better but only marginally. For most tasks if GLM 5.2 is the only model I have to use I'm productive and happy. This was not true before GLM 5.2. No doubt in my mind that the gap is closing quickly and for most tasks that are not very specialized open models will be usably on par on flagship closed models and have an edge factoring in cost.
For coding I still use 5.5 w/ Codex and prefer that to other models + harness combinations.
GL26 3 hours ago
if someone has any tutorial on how to run GLM-5.2 from a Rasberry Pi 5 (AI hat), I want it !
efficax 2 hours ago
GLM-5.2 is a huge model. I don't think it would fit on the AI HAT+ 2 even if you quantized it to 2 bits
seany 17 hours ago
What's the current best for ablation? Specifically chemistry and red-team/netsec?
forsalebypwner 14 hours ago
ime DeepSeek v4 Pro is great for cybersec/netsec, I have not tried GLM though
NovaCode37 9 hours ago
Honestly, glm is staying quiet close to claude but it can save tons of tokens either than anthropic model
yogthos 14 hours ago
It's by far the most competent open model I've tried yet. It's a bit slower than Claude, but in terms of coding capability it seems to get comparable results at least for the work I'm doing.
newaccountman2 17 hours ago
5.1 and Qwen 3.6 are great too IMO
nubg 4 hours ago
A question I always have is, how to the AI labs safeguard the leak of their model? Training a cutting edge model basically cost a minimum of hundreds of millions of dollars. And its all contained within a file. Okay, that file might be 500GB large, but its still just one blob that is worth almost a billion dollars. And they need to train new models every few weeks, have lots of people with access to it to debug it, run inference etc. I wonder when we will see the first leaks? Imagine if e.g. Opus 4.8 got leaked. Wouldnt that bankrupt Anthropic?
alfiedotwtf 8 hours ago
Once open Chinese models look like they’re about to overtake closed US models, watch the US government push imperialism hidden behind increasingly hyperbolic national security concerns.
At the end of the day, open weights should be seen as nothing more than information (just more just numbers afterall), and so organisations like the EFF should sue for any restricting of the 1st Amendment
dools 17 hours ago
Is z.ai
Is 2 better than x.ai
citizenpaul 19 hours ago
Ive been using glm5 since its release and still prefer it to glm5.1 and so far to glm5.2
Perhaps it is just my harness and workflow, but the older model still seems to work better. Also the token cost is significantly lower. I rarely spend more than $20 a week with $50 cap. Not even half claudes ambiguous minimum $200 a month plan.
rainmaking 18 hours ago
Now that's a tremendous pointer, I'm going to have to try that.
Do you full on let GLM5 get stuff done on its own or is it more like a guided workflow? The former's what the point releases doubled down on and is also something that uses a lot of juice.
Balinares 2 days ago
I can't help wondering what kind of models we'll see coming out of China once it gets its own chip fabs up and running. Right now it sounds like the US's export ban is not slowing them down a whole lot.
khurs 4 hours ago
>Right now it sounds like the US's export ban is not slowing them down a whole lot.
Just costing them a lot more money as they pay multiples more buying on the underground grey market.
ceejayoz 20 hours ago
> Right now it sounds like the US's export ban is not slowing them down a whole lot.
It may wind up being a massive boost to them in the long run, even.
Necessity is the mother of invention.
pkroll 20 hours ago
If this pans out, you're not at all kidding: https://www.youtube.com/watch?v=8ekndZwyOzo
verdverm 15 hours ago
Trump allowed more advanced chips (H200s) to be sold after his visit, because some people in the admin still believe the US can "addict" China to the hardware. It seems China is only letting a token few in, the ban is more on their side now, as Xi really wants indiginous capability.
pianopatrick 17 hours ago
There does not seem to be a big penalty for going slow anyways. People seem to just switch on cost as soon as a model can do a task well enough. There do not seem to be strong network effects or vendor lock in.
Seems to me that going slow is the better long term tactic. China can just let the USA pay the high R&D costs to figure out what works, then just copy what works.
briga 17 hours ago
With subsidization from the Chinese government they will probably be equal to or better than the models here. I mean, have you looked at the author list of any given AI paper published within, say, the past 5 years? I wouldn't be surprised if half or more AI researches are from China.
buzzin__ 17 hours ago
Can you compare the amount to the USA subsidization? Which one is bigger? Per Capita? Per unit of economic growth achieved?