Something is afoot in the land of Qwen (simonwillison.net)

413 points by simonw 6 hours ago

sosodev 6 hours ago

I really hope this doesn't hinder development too much. As Simon says, Qwen3.5 is very impressive.

I've been testing Qwen3.5-35B-A3B over the past couple of days and it's a very impressive model. It's the most capable agentic coding model I've tested at that size by far. I've had it writing Rust and Elixir via the Pi harness and found that it's very capable of handling well defined tasks with minimal steering from me. I tell it to write tests and it writes sane ones ensuring they pass without cheating. It handles the loop of responding to test and compiler errors while pushing towards its goal very well.

misnome 4 hours ago

I've been playing with 3.5:122b on a GH200 the past few days for rust/react/ts, and while it's clearly sub-Sonnet, with tight descriptions it can get small-medium tasks done OK - as well as Sonnet if the scope is small.

The main quirk I've found is that it has a tendency to decide halfway through following my detailed instructions that it would be "simpler" to just... not do what I asked, and I find it has stripped all the preliminary support infrastructure for the new feature out of the code.

storus 2 hours ago

> to decide halfway through following my detailed instructions that it would be "simpler" to just... not do what I asked

That's likely coming from the 3:1 ratio of linear to quadratic attention usage. The latest DeepSeek also suffers from it which the original R1 never exhibited.

sheepscreek 3 hours ago

That sounds awfully similar to what Opus 4.6 does on my tasks sometimes.

> Blah blah blah (second guesses its own reasoning half a dozen times then goes). Actually, it would be a simpler to just ...

Specifically on Antigravity, I've noticed it doing that trying to "save time" to stay within some artificial deadline.

It might have something to do with the system messages and the reinforcement/realignment messages that are interwoven into the context (but never displayed to end-users) to keep the agents on task.

wood_spirit 3 hours ago

shaan7 3 hours ago

> that it would be "simpler" to just... not do what I asked

That sounds too close to what I feel on some days xD

reactordev 4 hours ago

Turn down the temperature and you’ll see less “simpler” short cuts.

smokel 2 hours ago

Aurornis an hour ago

> The main quirk I've found is that it has a tendency to decide halfway through following my detailed instructions that it would be "simpler" to just... not do what I asked,

This is my experience with the Qwen3-Next and Qwen3.5 models, too.

I can prompt with strict instructions saying "** DO NOT..." and it follows them for a few iterations. Then it has a realization that it would be simpler to just do the thing I told it not to do, which leads it to the dead end I was trying to avoid.

Twirrim 5 hours ago

I've been testing the same with some rust, and it's has spent a fair bit of time going through an infinite seeming loop before finally unjamming itself. It seems a little more likely to jam up than some other models I've experimented with.

It's also driving itself crazy with deadpool & deadpool-r2d2 that it chose during planning phase.

That said, it does seem to be doing a very good job in general, the code it has created is mostly sane other than this fuss over the database layer, which I suspect I'll have to intervene on. It's certainly doing a better job than other models I'm able to self-host so far.

Aurornis 5 hours ago

> it's has spent a fair bit of time going through an infinite seeming loop before finally unjamming itself.

I think this is part of the model’s success. It’s cheap enough that we’re all willing to let it run for extremely long times. It takes advantage of that by being tenacious. In my experience it will just keep trying things relentlessly until eventually something works.

The downside is that it’s more likely to arrive at a solution that solves the problem I asked but does it in a terribly hacky way. It reminds me of some of the junior devs I’ve worked with who trial and error their way into tests passing.

I frequently have to reset it and start it over with extra guidance. It’s not going to be touching any of my serious projects for these reasons but it’s fun to play with on the side.

sosodev 5 hours ago

Some of the early quants had issues with tool calling and looping. So you might want to check that you're running the latest version / recommended settings.

misnome 4 hours ago

> and it's has spent a fair bit of time going through an infinite seeming loop before finally unjamming itself

I can live with this on my own hardware. Where Opus4.6 has developed this tendency to where it will happily chew through the entire 5-hour allowance on the first instruction going in endless circles. I’ve stopped using it for anything except the extreme planning now.

cbm-vic-20 3 hours ago

I don't know much about how these models are trained, but is this behavior intentional (ie, the people pulling the levers knew that this is how it would end up), or is it emergent (ie, pulling the levers to see what happens)?

anana_ 3 hours ago

I've had even better results using the dense 27B model -- less looping and churning on problems

abhikul0 5 hours ago

Are you running it locally with llama.cpp? If so, is it working without any tweaking of the chat template? The tool calls fail for me when using the default chat template, however it seems to work a whole lot better with this: https://huggingface.co/Qwen/Qwen3.5-35B-A3B/discussions/9#69...

arcanemachiner 4 hours ago

Have you tried the '--jinja' flag in llama-server?

abhikul0 4 hours ago

nu11ptr 5 hours ago

What hardware do you have it running on? Do you feel you could replace the frontier models with it for everyday coding? Would/will you?

sosodev 2 hours ago

Around 20ish tokens a second with 6-bit quant at very long context lengths on my AMD AI Max 395+

I’m trying to use local models whenever possible. Still need to lean on the frontier models sometimes.

politelemon 4 hours ago

60 to 70 on a 5080, but only tinkering for now. The smaller models seem exceptionally good for what they are, and some can even do OCR reliably.

bigyabai 5 hours ago

I'm getting ~30 tok/s on the A3B model with my 3070 Ti and 32k context.

> Do you feel you could replace the frontier models with it for everyday coding? Would/will you?

Probably not yet, but it's really good at composing shell commands. For scripting or one-liner generation, the A3B is really good. The web development skills are markedly better than Qwen's prior models in this parameter range, too.

whalesalad an hour ago

What hardware are you running this on?

paoliniluis 6 hours ago

what's your take between Qwen3.5-35B-A3B and Qwen3-Coder-Next?

sosodev 6 hours ago

In my experience Qwen3.5 is better even at smaller distillations. From what I understand the Qwen3-next series of models was just a test/preview of the architectural changes underpinning Qwen3.5. So Qwen3.5 is a more complete and well trained version of those models.

kamranjon 5 hours ago

In my experience qwen 3 coder next is better. I ran quite a few tests yesterday and it was much better at utilizing tool calls properly and understanding complex code. For its size though 3.5 35B was very impressive. coder next is an 80b model so i think its just a size thing - also for whatever reason coder next is faster on my machine. Only model that is competitive in speed is GLM 4.7 flash

xrd 5 hours ago

karmakaze 5 hours ago

We don't have a Qwen3.5-Coder to compare with, but there is a chart comparing Qwen3.5 to Qwen3 including Qwen3-Next[0].

[0] https://www.reddit.com/r/LocalLLaMA/comments/1rivckt/visuali...

a3b_unknown 5 hours ago

What is the meaning of 'A3B'?

simonw 5 hours ago

It's the number of active parameters for a Mixture of Experts (misleading name IMO) model.

Qwen3.5-35B-A3B means that the model itself consists of 35 billion floating point numbers - very roughly 35GB of data - which are all loaded into memory at once.

But... on any given pass through the model weights only 3 billion of those parameters are "active" aka have matrix arithmetic applied against them.

This speeds up inference considerably because the computer has to do less operations for each token that is processed. It still needs the full amount of memory though as the 3B active it uses are likely different on every iteration.

zozbot234 3 hours ago

hintymad 4 hours ago

There has been tension between Qwen's research team and Alibaba's product team, say the Qwen App. And recently, Alibaba tried to impose DAU as a KPI. It's understandable that a company like Alibaba would force a change of product strategy for any number of reasons. What puzzled me is why they would push out the key members of their research team. Didn't the industry have a shortage of model researchers and builders?

cmrdporcupine 3 hours ago

Perhaps they wanted future Qwen models to be closed and proprietary, and the authors couldn't abide by that.

softwaredoug 6 hours ago

I wonder how a US lab hasn't dumped truckloads of cash into various laps to ensure these researchers have a place at their lab

gaoshan 5 hours ago

ICE has been detaining Chinese people in my area (and going door to door in at least one neighborhood where a lot of Chinese and Indians live). I was hearing about this just last week as word spread amongst the Chinese community here (Ohio) to make sure you have some legal documentation beyond just your driver's license on you at all times for protection. People will hear about this through the grapevine and it has a massive (and rightly so) chilling effect. US labs can try but with US government behaving like it is I don't think they will have much luck.

*edit: not that it matters, but since MAGA can't help but assume, these are all US citizens and green card holders that I am referring to.

bobthepanda 5 hours ago

Yeah, the Hyundai factory fiasco kind of dashed the idea that the enforcement would spare people working in favored industries setting up in the US.

genxy 3 hours ago

jiggawatts an hour ago

"Papers, please." comes to the US of A.

ljsprague 4 hours ago

Are the people being detained in the country illegally?

Jcampuzano2 4 hours ago

Conscat 3 hours ago

gaoshan 3 hours ago

mattnewton 4 hours ago

misnome 4 hours ago

sourcegrift 5 hours ago

Yes. Yes, so true. And the phd types building these models are probably even scared in China that ICE will fly there to deport them.

jwolfe 4 hours ago

velcrovan 5 hours ago

What the US has done is dumped truckloads of cash to make it likely that as a legal immigrant you will be abducted and sent to a camp.

riddlemethat 5 hours ago

This is FUD. The US has dumped truckloads of cash to make it likely that masked men with no cameras and little training will parade around abducting anyone they even suspect of being an illegal immigrant, after even Yale admitted it's likely that more than 22M+ people came here illegally. https://insights.som.yale.edu/insights/yale-study-finds-twic...

It'd be good if Congress could do something to remove the masks, put cameras on these agents, and for the local governments to stop fighting removal of all people who are here illegally so we can pretend we have borders again.

mattnewton 4 hours ago

cmrdporcupine 4 hours ago

seanmcdirmid 16 minutes ago

They already kind of do, but I think anyone who was into US money has already left for it, and the money China is throwing at the problem is pretty good also. You can also have a lot more influence in a Chinese company without having to adopt a weird new American corporate culture.

mft_ 6 hours ago

Indeed; or, Europe badly needs a competitive model to hedge against US political nonsense.

ivan_gammel 5 hours ago

Offering „You are welcome“ relocation package to Anthropic might be a good idea.

cmrdporcupine 4 hours ago

Imustaskforhelp 4 hours ago

mijoharas 2 hours ago

It'd be great if they went to Mistral!

tiahura 5 hours ago

Competitive models are illegal in the EU.

ecshafer 5 hours ago

China is also giving them dump trucks full of cash though. Plus you have to content with the nationalism reason (unfortunately this has died off in America for too many). The idea of building your country is valued for most Chinese I have met. Plus China is incredibly nice to live in, especially if you have lots of money and/or connections. So you can work in China, get paid lots of money, feel like you are doing good. Or In America you can get paid lots of money, and get yelled at by people online because the Government wants to use your model.

danny_codes 5 hours ago

China city life is amazingly convenient. Trains and subways are just such an enormous quality of life boost. Add to that the relative cleanliness of having nearly zero homelessness and you’ve got something very compelling.

I will say we are winning in accessibility. China doesn’t have much of a ramp game

softwaredoug 5 hours ago

1024core 4 hours ago

I got an offer out of the blue for a consulting gig in ML, offering USD 400/hr in China. Assuming this was legit (the offeror seemed legit), it looks like China is also throwing a lot of Benjamins around...

petcat 5 hours ago

> Or In America you can get paid lots of money, and get yelled at by people online because the Government wants to use your model.

Isn't it just straight-up illegal in China to refuse the government from using your model? USA isn't perfect, but at least it has active discourse.

neves 4 hours ago

ecshafer 4 hours ago

leptons 3 hours ago

Chinese people are very racist towards non-Chinese. It might seem like a happy utopia, but if you aren't Chinese, then you may not really enjoy your time there. It may not be quite as bad as being black in rural US south, but being black (or anything non-Chinese) in China is still not going to be a good time.

WarmWash 3 hours ago

losvedir an hour ago

px43 3 hours ago

maxglute 2 hours ago

> get yelled at by people online because the Government wants to use your model

Well duh, as recently demonstrated, an US model used by the US gov will 100% end up murdering actual children sooner than later, in this case less than a calendar year in some far flung war that many Americans do not support. Alternatively PRC model used by CCP might kill in some hypothetical future but for national reunification/rejuvenation that many Chinese support. At the end of the day, researchers and population on one side sleeps more soundly.

VWWHFSfQ 4 hours ago

> China is incredibly nice to live in

I'm sure it's a very nice place to live if you're content to just stay quiet in society and never put a political sign in your yard or even just talk about the wrong thing with your friend in a WeChat.

cyberax 3 hours ago

bdangubic 3 hours ago

jamespo 5 hours ago

Damn that social conscience, huh?

expedition32 an hour ago

If memory serves the father of the Chinese bomb studied in America and went back. It may be inconceivable to Americans but Chinese patriotism exists.

Besides you can live a comfortable life in PRC nowadays or live in a racist America.

mmaunder 4 hours ago

Yeah that was my first thought is it’s a tit for tat poach. They got the Gemini researcher so google responded in kind.

lynndotpy 3 hours ago

Well, the problem aren't just the NSF funding cuts. Everyone else is already dumping truckloads of cash. There's also the public health situation (who wants measles or polio?), the risk of retaliatory attacks from the countries we're at war with, etc. You could write paragraphs about why the US is less attractive to researchers.

When I was a deep learning PhD in the first Trump administration, US universities were already very deeply affected by the Muslim ban, and so a lot of talent ended up in other countries.

Sibling commentators are rightfully pointing out that foreigners, especially those who would not be recognized as white, face an onerous and risky customs process with long-term and increasing risks of deportation. When you see a headline like the NIST labs abruptly restricting foreign scientists, _everything_ else feels uncertain. Even if someone doesn't believe they're personally at risk for deportation, they're still seeing everything else.

And then it all boils down to a reputational thing. The era where we were the top choice for research is in the past. If you start a PhD in the US on your resume during this era, you might be anticipating how you'll answe the question of why you weren't good enough to get accepted somewhere better.

bilbo0s 6 hours ago

They probably have tried, but you have to have more cash than those researchers feel they can get starting their own lab. When you consider the fact that their new startup lab would have the entire nation of China as, in effect, a captive market; you start to see how almost any amount of money would be too little to convince them not to make a run at that new startup. If money is their aim.

I think Alibaba needs to just give these guys a blank check. Let them fill it in themselves. Absent that, I'm pretty sure they'll make their own startup.

I do think it'd be a big loss for the rest of the world though if they close whatever model their startup comes up with.

simgt 5 hours ago

> I do think it'd be a big loss for the rest of the world though if they close whatever model their startup comes up with.

That's very likely to happen once the gap with OpenAI/Anthropic has been closed and they managed to pop the bubble.

bobthepanda 5 hours ago

vicchenai an hour ago

Been running the 32B locally for a few days and honestly surprised how well it handles agentic coding stuff. Definitely punches above its weight. Only complaint is it sometimes decides to ignore half your prompt when instructions get long, but at this size I guess thats the tradeoff.

lzaborowski 2 hours ago

One thing I’ve noticed with local models is that people tolerate a lot more trial and error behavior. When a hosted model wastes tokens it feels expensive, but when a local model loops a bit it just feels like it’s “thinking.”

If models like Qwen can get good enough for coding tasks locally, the real shift might be economic rather than purely capability.

trvz 8 minutes ago

Wasted tokens are preferred for local models, I need the GPU mainframe in my bedroom to heat it as I live in a third world country with unreliable heating (Switzerland).

airstrike 6 hours ago

I'm hopeful they will pick up their work elsewhere and continue on this great fight for competitive open weight models.

To be honest, it's sort of what I expected governments to be funding right now, but I suppose Chinese companies are a close second.

skeeter2020 6 hours ago

Getting a bit of whiplash goin from AI is replacing people, to AI is dead without (these specific) people. Surely we're far enough ahead that AI can take it from here?

Wild times!

janalsncm 4 hours ago

Anthropic has one nine of uptime right now. One.

https://status.claude.com/

If AI could effectively replace people, you wouldn’t need CEOs to keep trying to convince people.

Jeremy1026 an hour ago

Anthropic also fires off the alarm bells seemingly at any sign of issue. I've personally only noticed an outage once, and the status page wasn't even showing it as down at that time. It eventually did update about 45 minutes later, then I was back up and running another 15 minutes later but the "outage" on the status page stayed up for another hour or so.

Probably good to sent alerts early, but they might be going a bit too early.

OsrsNeedsf2P 3 hours ago

That's 99% is two nines?

janalsncm an hour ago

kylemaxwell 2 hours ago

Everything on that page has two nines, so not sure what you're trying to say here.

janalsncm an hour ago

relaxing 2 hours ago

mungoman2 3 hours ago

Not sure what the uptime is meant to signal. People have quite low uptime as well…

greenchair an hour ago

jug 3 hours ago

px43 3 hours ago

9% uptime?

AgentME a minute ago

vidarh 6 hours ago

Who is suggesting "AI is dead without (these specific) people"? People are wondering what it means specifically for the Qwen model family.

mhitza 6 hours ago

We've gone from AGI goals to short-term thinking via Ads. That puts things better in perspective, I think.

dude250711 4 hours ago

Claude is incapable of producing a native application for itself, and is bad enough with web ones to justify Anthropic acquiring Bun.

quantum_state 5 hours ago

I would second that Qwen3.5 is exceptionally good. In a calibration, it (35b variant) was running locally with Ada NextGen 24GB to do the same things with easy-llm-cli in comparison with gemini-cli + Gemini 3 Pro, they were at par … really impressive it ran pretty fast …

vardalab 3 hours ago

q4 quant gives you 175 tg and 7K pp, beats most cloud providers

zoba 6 hours ago

I tried the new qwen model in Codex CLI and in Roo Code and I found it to be pretty bad. For instance I told it I wanted a new vite app and it just started writing all the files from scratch (which didn’t work) rather than using the vite CLI tool.

Is there a better agentic coding harness people are using for these models? Based on my experience I can definitely believe the claims that these models are overfit to Evals and not broadly capable.

sosodev 6 hours ago

I've noticed that open weight models tend to hesitate to use tools or commands unless they appeared often in the training or you tell them very explicitly to do so in your AGENTS.md or prompt.

They also struggle at translating very broad requirements to a set of steps that I find acceptable. Planning helps a lot.

Regarding the harness, I have no idea how much they differ but I seem to have more luck with https://pi.dev than OpenCode. I think the minimalism of Pi meshes better with the limited capabilities of open models.

malwrar 3 hours ago

+1 to this, anecdotally I’ve found in my own evaluations that if your system prompt doesn’t explicitly declare how to invoke a tool and e.g. describe what each tool does, most models I’ve tried fail to call tools or will try to call them but not necessarily use the right format. With the right prompt meanwhile, even weak models shoot up in eval accuracy.

vardalab 3 hours ago

Have frontier lab do the plan which is the most time consuming part anyways and then local llm do the implementation. Frontier model can orchestrate your tickets, write a plan for them and dispatch local llm agents to implement at about 180 tokens/s, vllm can probably ,manage something like 25 concurrent sessions on RTX 6000 Do it all in a worktrees and then have frontier model do the review and merge. I am just a retired hobbyist but that's my approach, I run everything through gitea issues, each issue gets launched by orchestrator in a new tmux window and two main agents (implementer and reviewer get their own panes so I can see what's going on). I think claude code now has this aspect also somewhat streamlined but I have seen no need to change up my approach yet since I am just a retired hobbyist tinkering on my personal projects. Also right now I just use claude code subagents but have been thinking of trying to replace them with some of these Qwen 3.5 models because they do seem cpable and I have the hardware to run them.

Tepix 2 hours ago

What is "the new qwen model"? There are a dozen and you can get them in a dozen different quantizations (or more) which are of different quality each.

lreeves 2 hours ago

In my experience Qwen3.5/Qwen3-Coder-Next perform best in their own harness, Qwen-Code. You can also crib the system prompt and tool definitions from there though. Though caveat, despite the Qwen models being the state of the art for local models they are like a year behind anything you can pay for commercially so asking for it to build a new app from scratch might be a bit much.

lacoolj 3 hours ago

I wonder if an american company poached one/all of them. They've been pretty much bleeding edge of open models and would not surprise me if Amazon or Google snatched them up

ferfumarma 3 hours ago

It would surprise me if they're willing to come to the US in the setting of the current DHS and ICE situation.

ilaksh 5 hours ago

Does anyone know when the small Qwen 3.5 models are going to be on OpenRouter?

armanj 5 hours ago

yorwba 5 hours ago

There are smaller ones on HuggingFace https://huggingface.co/models?other=qwen3_5&sort=least_param... with 0.8B, 2B, 4B and 9B parameters.

ilaksh 5 hours ago

Like 4B, 2B, 9B. Supposedly they are surprisingly smart.

Sakthimm 4 hours ago

ChrisArchitect 6 hours ago

nurettin 3 hours ago

I am singularly impressed by 35B/A3, hope that is not the reason he had to leave.

raffael_de 6 hours ago

> me stepping down. bye my beloved qwen.

the qwen is dead, long live the qwen.

w10-1 2 hours ago

It sounds like the lead was demoted to attract new talent, quit as a result, and the rest of the team also resigned to force management to change their minds.

If so, I'm happy that the team held together, and I hope that endogenous tech leads get to control their own career and tech destiny after hard work leads to great products. (It's almost as inspiring as tank man, and the tank commanders who tried to avoid harming him...)

(ducking the downvote for challenging the primacy of equity...)

hwers 6 hours ago

My conspiracy theory hat is that somehow investors with a stake in openai as well is sabotaging, like they did when kicking emad out of stabilityai

storus 4 hours ago

More likely some high ranking party member's nepobaby from Gemini sniffed success with Qwen and the original folks just walked away as their reward disappeared.

ahmadyan 3 hours ago

source?

WarmWash 2 hours ago

liuliu 4 hours ago

apples v.s. oranges. The later is true, Emad did get sabotaged (for not being able to raise money in time, about 8-month before he's leaving). Junyang didn't have that long arc of incidents.

vonneumannstan 6 hours ago

Were they kneecapped by Anthropic blocking their distillation attempts?

zozbot234 3 hours ago

What Anthropic was complaining about is training on mass-elicited chat logs. It is very much a ToS violation (you aren't allowed to exploit the service for the purpose of building a competitor) so the complaint is well-founded but (1) it's not "distillation" properly understood; it can only feasibly extract the same kind of narrow knowledge you'd read out from chat logs, perhaps including primitive "let's think step by step" output (which are not true fine-tuned reasoning tokens); because you have no access to the actual weights; and (2) it's something Western AI firms are very much believed to do to one another and to Chinese models all the time anyway. Hence the brouhaha about Western models claiming to be DeepSeek when they answer in Chinese.

red2awn 3 hours ago

The "distillation attacks" are mostly using Claude as LLM-as-a-judge. They are not training on the reasoning chains in a SFT fashion.

zozbot234 3 hours ago

kartika848484 4 hours ago

what the hell, their models were promising tho

multisport 6 hours ago

inb4 qwen is less of a supply chain risk than anthropic