Qwen3.7-Max: The Agent Frontier (qwen.ai)
582 points by kevinsimper 13 hours ago
goldenarm 10 hours ago
The non-hallucination rate in AA-omniscience is SOTA, better than Opus 4.7, Gemini 3.1 Pro and GPT5.5! Congrats to the team
girvo 30 minutes ago
The big question for me having used a lot of these SOTA chinese models is: what is its token efficiency like?
Running Step 3.5 Flash locally for example, it's an amazingly capable model all things considered, but it's token efficiency is so bad that it gets out performed by most others wall-clock time (even with my MTP-support for it hacked in to llama.cpp: despite being trained on three heads, MTP 2 is the sweet spot, and only gets it from 20tk/s to 30tk/s on my Spark)
The DeepSeek models and Qwen 3.5 Plus are also good examples of this: compared to Opus, and especially GPT 5.5 they use many more tokens to get to the same answers.
I'm really hoping that Qwen 3.7 is better in this regard, can't wait to try it out
(ps. running DeepSeek v4 Flash on my Spark is absolutely wild, thanks antirez if you see this haha)
throawayonthe 9 hours ago
referencing this:
https://artificialanalysis.ai/evaluations/omniscience?models...
(had to add it to the chart, wasn't displayed by default. is it the lowest rate in the datasetor no?)
jampekka 4 hours ago
This counts only incorrect answers though. A model can get 0% hallucination rate just by refusing to answer all questions.
ffsm8 3 hours ago
jug an hour ago
speed_spread 3 hours ago
gslepak 8 hours ago
> The non-hallucination rate in AA-omniscience is SOTA
Note that a perfect "non-hallucination rate" is rather meaningless as such tests can contain human hallucinations.
It means the model aligns with the possibly-true, possibly-false beliefs of the group that made the test.
rlt 7 hours ago
Well, yes, garbage in garbage out. That's a given and not what's meant by "hallucination" in this context.
tantaman 4 hours ago
jcheng 6 hours ago
Here are some examples of the questions in the benchmark. If these are representative, they seem pretty cut and dry. https://artificialanalysis.ai/evaluations/omniscience#exampl...
areweai 3 hours ago
Was there something about this specific model and submission that made you feel compelled to write this self-evident observation?
Or would you describe your methodology as more like picking a random sentence fragment as an input value then generating completions from your existing corpus without any post-input "learning" process related to the rest of the source material?
sheepscreek 8 hours ago
Truly incredible! Very impressed by their progress. I wonder how much of their own chips did they use for training.
baq 8 hours ago
wonder at which level there's a capability state transition? 5%? 1%?
briga 8 hours ago
I was getting dangerously close to my weekly Claude Code limit last night so I had Claude set up Qwen3.6 with llama.cpp and OpenCode. Honestly it's a great (free!) alternative to Claude Code--certainly more than good enough for a lot of smaller less complex tasks. I'm excited to try this new version. The fact that open-source models are so close to the frontier is very impressive.
pixelesque 6 hours ago
Out of interest, what machine and model are you running it on?
I tried the qwen3.6-27b Q6_k GUFF in llama.cpp and LM Studio on my M2 MacBook Pro 32GB machine last week, and I barely get a token a second with either.
What sort of speed should I be expecting?
I tried some of the Llama 3 34b (nous-capybara?) models two years ago with llama.cpp, and I seem to remember getting a few tokens a second then, so not sure if I've got something completely mis-configured, or I just have unreasonable expectations.
Or maybe qwen 3.x is slower for some reason? (Is it mixture of experts?)
I'm not expecting it to be instant, but what I'm currently seeing is not really usable.
gcr 6 hours ago
There are two flavors of Qwen 3.6:
- A 27B "dense" model
- A 35B "Mixture of Experts" model, which activates only 3B parameters for each token.
For your hardware, I strongly recommend `unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M`. I have an M1 Max with 32GB VRAM from 2021 that can read at ~300-500 tokens/sec and write at ~30 tokens/sec with llama-cpp's default settings, which is plenty fast. The 27B model can read ~70tok/sec and write ~5tok/sec.
The 35B MoE model technically takes slightly more memory but is much faster because it's doing 1/9th the work. It's not quite as "smart", but it's comparable.
flockonus 3 hours ago
pixelesque 5 hours ago
julianlam 5 hours ago
DiabloD3 3 hours ago
I recommend sticking with the dense models for both Qwen and Gemma.
On testing I've done on same-quant apples to apples, with F16/F16 (ie, unquantized) kv cache, 35B-A3B underperforms against 27B on anything even remotely complex. But yes, 35B-A3B can be like 3-4x faster on my hardware.
By Qwen's own admission, on any meaningful benchmark (ie, ones that involve logic, math, or tool calling), 27B performs like 122B-10B and 397B-A17B, but 35B-A3B is somewhere between 27B dense and 9B dense.
Also, MTP recently got merged in, so I'd suggest downloading Qwen 3.6 MTP (I assume you get it from unsloth) and updating your copy of llama.cpp, and adding `--spec-type draft-mtp --spec-draft-n-max 2` to your arguments.
https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF/ https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/
Also, I recommend not quantizing kv cache, and if you do, only quantize v. Lowering model quant while also lowering context size to fit F16/F16 or F16/Q8_0 massively improves model performance for thinking models. Also, quantizing cache, either k or v, decreases speed by a lot on some hardware.
I have a 24gb 7900xtx, so I can fit >32k F16/F16 context with Qwen3.6-27B, but use unsloth's Q3_K_XL. This performs better than Q(4,5,6)_K_XL with v quantized.
Edit: Oh, and since I mentioned Gemma 4, my testing mirrors my Qwen 3.5/3.6 experiences, 26B-A4B performs worse than 31B, but is also way faster. llama.cpp doesn't support Gemma 4's MTP style yet, so both could get even faster.
booty 4 hours ago
I tried the qwen3.6-27b Q6_k GUFF in llama.cpp
and LM Studio on my M2 MacBook Pro 32GB machine
last week, and I barely get a token a second with either.
The fact that it was this slow makes me suspect it's a matter of insufficient free RAM. The entire model needs to fit into RAM (and stay there the entire time) for acceptable performance.(not sure of exact diagnosis/fix, but definitely look in that direction if you're still having this issue when you give it another shot)
Also, there are two stages - prompt processing, and token generation. Prompt processing is notoriously slow on Apple Silicon unfortunately. If you have large context (which includes system prompts, lots of tools loaded by a harness like Claude Code, OpenCode, etc) it can take minutes for prompt processing before you see the first output token. On the bright side, the tokens are cached between turns, so subsequent turns won't be so bad.
mark_l_watson 4 hours ago
satvikpendem 4 hours ago
Check out Unsloth Studio it provides MTP support now which 2x the token generation speed with no loss of accuracy: https://unsloth.ai/docs/models/qwen3.6#mtp-guide
mft_ 6 hours ago
The 27B model is dense, so is relatively slow. The 35B-A3B model is marginally weaker but being MoE is much faster - like ~4-8x faster in basic benchmarks on my M1 Max.
For comparison, I just ran a couple of quick benchmarks (default settings) with llama-bench:
Qwen3.6-35B-A3B at Q6_K_XL gave 858 t/s pp512 (prompt processing) and 43 t/s tg128 (token generation).
Qwen3.6-27B at Q4_K_XL gave 103 t/s pp512 and 8 t/s tg128.
stebalien an hour ago
pixelesque 5 hours ago
electroglyph 12 minutes ago
you should be using dflash with that model, look it up
127 an hour ago
I get 150t/s peak, 120t/s avg with Qwen3.6 27B Q4 with a 4090 on Linux. Now that MTP has landed into llama.cpp.
Figs 6 hours ago
27B is the dense one. Try the Qwen3.6-35B-A3B variants for the MoE release. That's what I'm running on a Framework Desktop and I get ~50 tok/s plus or minus a few. The dense one is similarly slow for me -- not sure what to expect on your hardware from the MoE but it should probably be much faster.
pixelesque 5 hours ago
KronisLV 6 hours ago
> qwen3.6-27b Q6_k
That's the dense model, you probably want a mixture-of-experts (MoE) one.
Here's what you probably want instead: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF
pixelesque 5 hours ago
dzr0001 4 hours ago
My token throughput is much better using vLLM-mlx on my M2 ultra than llama.cpp. It might be worth a shot to give it a try.
plufz 8 hours ago
Which exact model are you using? And with which parameters and quant? And on what hardware? Are you using any specific MCPs or other tools to optimize performance like context-mode or dynamic context pruning? I’ve used local models a reasonable amount before but I’m just starting out with opencode. Haven’t had great results yet but really want this to work for simpler tasks. My opencode newly installed is also having iterm on 100% cpu in idle. :/
briga 8 hours ago
I'm running Qwen3.6:27b Q4 KM on a 4090 and similarly fast CPU and I think 32GB of RAM. Make sure the context window is set to be big enough otherwise the conversation will keep compacting. No special MCP tools set up yet. Qwen is able to do web search out-of-the-box although I think it is getting blocked by anti-bot firewalls--I still need to figure out if I can fix that.
SeriousM 5 hours ago
gcr 6 hours ago
here's a simple setup to get you started on an Apple M1 Max from 2021 with 32GB VRAM. it will download 20GB of models to `~/.cache/huggingface/hub`, which you can delete when you're done.
/Users/gcr/llama.cpp/build/bin/llama-server
-hf unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M
--no-mmproj-offload
--fit on
-c 65536 # edit to taste
--reasoning on --chat-template-kwargs '{"preserve_thinking": true}'
--sleep-idle-seconds 90 # very aggressive: purge model from vram after this long
-ctk q8_0 -ctv q8_0 # Optional. Lower memory use, but lower speed. Omit if you can.
I don't recommend ollama or lm-studio. Ollama's in the process of switching from their llama-cpp backend anyway, but their new go framework frequently OOMs and crashes on my hardware. I also don't recommend MLX-based inference backends on this hardware; I've found them to consistently reduce performance, contrary to what I've read online. I've tried all the llama-cpp metal forks, but right now, MTP, TurboQuant, MLX, etc etc etc are too new and just slow things down. It's all dust in the wind still.For agent harnesses, opencode is okay, as is pi or even Zed's built in agent panel. Claude code "works" with ANTHROPIC_BASE_URL=http://localhost:8080/v1, but is very chatty (the default system prompt burns 20k tokens). Crush (from the charm-bracelet folks) is particularly nice when starting out. I've personally converged on pi-agent under an otherwise-mostly-default setup. You can ask qwen to customize pi or write you an extension which helps a little.
You'll need to add `http://localhost:8080/v1` as an OpenAI-compatible model provider in your coding harness with any API key (doesn't matter) and any model identifier (doesn't matter with llama-cpp).
Note that pi doesn't have permissions. Everything is permitted. The hundred hungry ghosts you've trapped in a jar WILL find a way to delete your home folder someday. That's what Man gets for summoning demons without casting a circle of protection first. Flying too close to the sun etc etc etc
Take backups and then go have fun. Hope this helps.
srcrip 10 minutes ago
leonidasv 8 hours ago
Qwen Max are usually closed, unfortunately.
mostafab 2 hours ago
That's a signal of being SOTA.
wuliwong 4 hours ago
Do you have a feel for how it Qwen 3.6 compares to Sonnet 4.6? B/C in reality, that's what we use a lot. If we just use Opus 4.7 for everything code related, we'd have a monthly bill 10-20 times higher than using Sonnet where we can.
briga 3 hours ago
I would say if Sonnet is a senior engineer, then Qwen3.6 (the 27b model) is probably closer to a junior engineer. Still capable of getting stuff done, just needs more guidance and makes mistakes more often.
Maybe that's underselling it. It is quite a good model and might end up replacing a lot of the work I was sending to Sonnet 4.6.
Also, Sonnet 4.6 is almost certain a much bigger model so the performance differences aren't unexpected.
ecshafer 7 hours ago
Qwen3.6 with claude code works great. I get a lot better results with that than opencode and qwen3.6. Claude Code is a great harness, and good harness/tool integration makes a big difference. You just have a settings.json with your ollama setup and the qwen model and you can use it.
growt 4 hours ago
Where and how do you run that? I tried it but somehow I always ran out of context or generation was incredibly slow (mbp m4 pro 48gb).
kolinko 5 hours ago
As Opus maximalist ;) I was very surprised by the quality if Qwen3.6-27B - trying to figure out how to get it going on RTX 90k now to offload some lighter tasks :)
aembleton 3 hours ago
> Today we introduce Qwen3.7-Max, our latest proprietary model
This is not an open model
ttoinou 4 hours ago
Which agentic coding tool and how do you make sure you have prefix consistency ?
wouldbecouldbe 6 hours ago
This one doesnt seem to be open source though sadly. Using chinese servers is a step to far for me personally
gcr 6 hours ago
Look for an open release from the Qwen team in the coming weeks. They like to showcase their proprietary models first, which score higher on benchmarks anyway due to model size.
par 6 hours ago
Do you have an opinion on OpenCode vs Aider?
briga 3 hours ago
I haven't tried Aider yet but perhaps I will. Another one that seems to be getting traction is Pi Coding Agent.
sunaookami 2 hours ago
Aider is still around? That is pre-tool-calling era stuff. Better compare against Pi.
tekacs 10 hours ago
As they start to release more proprietary models, I so wish that they partnered with one of the major US hyperscalers to allow using these models through something US-domiciled.
Totally understand why it may not be reasonable or in their best interest (and that the US is _absolutely_ not doing the same reflexively). But it would be lovely to be able to try these out on production workloads in earnest.
embedding-shape 10 hours ago
Unless US hyperscalers do the same in reverse, I hope the status quo stays as it is. Either people are happy to share, and the sharing should happen both ways, or US hyperscalers can keep isolating themselves as they've done so far.
adjejmxbdjdn 10 hours ago
I do hope The U.S. hyperscalers do the same as well.
In an ideal world U.S. residents would use Chinese AI models and Chinese residents would use U.S. AI models.
Governments in both countries are collecting data for nefarious reasons. But the Chinese government has far less influence on a U.S. resident and vice versa.
We are all better off if our data is collected by a government halfway across the world instead of our own governments which hold incredible amounts of power over us.
adrianN 9 hours ago
MintPaw 4 hours ago
nickdothutton 10 hours ago
giancarlostoro 10 hours ago
boomskats 9 hours ago
CodingJeebus 10 hours ago
tmoravec 7 hours ago
Qwen3.6-Plus is available from Fireworks.
tekacs 4 hours ago
Thank you for pointing that out! If 3.7-Max makes its way to Fireworks that'd be a joy.
mostafab an hour ago
Alibaba Cloud has data centers in Mexico
dchftcs 8 hours ago
fireworks hosts Qwen 3.6 Plus, they might also get Qwen 3.7 Plus.
motiw 9 hours ago
ChatLLM support QWEN, do you consider this as US safe?
epolanski 10 hours ago
US hyperscalers, all of them, are financially invested in the US AI labs and have the incentives to keep the status quo.
0xbadcafebee 10 hours ago
I'm more interested in hearing specific reasons why one wouldn't use a Chinese company. Unless you're thinking Alibaba is going to ship chat logs to some government ministry that will then dole out proprietary information to new competitors (which doesn't seem logistically feasible), or you run a human rights organization, it feels a bit like FUD.
vessenes 9 hours ago
All this data is accessible to national security agencies; this is true in every country in the world.
China has more integration between intelligence and industry than many western countries, and it does present a higher risk of unwanted “tech transfer” to industry than running on oracle or Google or ms or Amazon does in the US.
DHS has long staffed full time agents in California to deal with foreign IP exfiltration - using qwen is like fast/easy mode for IP exfiltration: why make anyone get a job in your palo alto office when you can just send it to them in Hanzhou?
Upshot - If you have something proprietary you’re working on I would generally advise not to just direct send it to Alibaba.
HDBaseT 13 minutes ago
culi 7 hours ago
bachmeier 9 hours ago
> Unless you're thinking Alibaba is going to ship chat logs to some government ministry
This made me think of a Seinfeld episode: "I didn't know it was possible not to know that."
noelsusman 9 hours ago
>Unless you're thinking Alibaba is going to ship chat logs to some government ministry that will then dole out proprietary information to new competitors (which doesn't seem logistically feasible)
That's exactly the fear, and why would it not be logistically feasible? The threat is definitely a bit overhyped, but China has a longstanding track record of aggressive corporate espionage.
tekacs 9 hours ago
… building and selling a product to US companies that sends company-internal data to Chinese AI providers is not a particularly good way to get people to buy it.
Even if they weren’t individually worried about their proprietary data being shared with Chinese domestic competitors or with government… their audit / security programs likely wouldn’t allow it for a _huge_ range of types of data.
dpoloncsak 9 hours ago
Because my CEO thinks China scary big hacker guys over there
slicktux 33 minutes ago
I just started messing with local LLMs and honestly I’m pretty impressed. I have a workstation laptop with an NVIDIA A1000 (6GB VRAM) and 96GB of RAM. I rarely used my gpu. Occasional CAD design or Machine Learning with OpenCV.
I ran llama3:latest and it ran pretty fast! I’m curious to see how Qwen would run on my system.
maxdo 3 hours ago
No opus 4.7 , gpt5.5 , Gemini flash 3.5 in benchmarks
goyozi 12 hours ago
These are very good numbers. I still don’t get why they don’t compare against latest competitor versions in these posts, it’s not like we’re all not going to notice.
Eridrus 2 hours ago
Nobody releases numbers that show them to be worse than competitors lol.
This even applies to OpenAI & Anthropic who don't even eval on the same datasets a lot of the time.
NiloCK 10 hours ago
I find it forgivable if it's within minor version bump. (NB that x.5 is now a defacto major-version bump for LLMs for whatever reason).
Even with LLMs, posts like this don't just fall out of a coconut tree. If you have a set of target benchmarks for your own model, then keeping "the set" of side-by-side comparable models is its own maintenance headache.
Aurornis 10 hours ago
I think the argument is that trying to suggest that they’re close to N months from SOTA.
Realistically I assume they hope readers don’t notice the fine details.
The Qwen models are great for open weights but for every past release they haven’t performed as well as the benchmarks in my experience. They’re optimizing for benchmark numbers because they know it works.
epolanski 10 hours ago
> Realistically I assume they hope readers don’t notice the fine details.
The pool of people reading such articles while ignoring such details can't be big.
Aurornis 10 hours ago
htrp 11 hours ago
I think its part of the expectation setting (with a side of we did our distillation/ eval harness on a specific model).
if they say it's 4.7 comparable, it anchors that into your head as the model to evaluate against.
beydogan 10 hours ago
honestly, initial version of Opus-4.6 was much better than whatever we are being served right now as 4.7. If it performs same level to that, i'm totally willing to switch.
hypercube33 9 hours ago
4.6 was an awful experience the month I used it right after launch where it didn't ask anything just made assumptions and went on its merry way. 4.5 and 4.7 don't do that for me but 4.7 eats my quota for breakfast so I've been avoiding using it because I like to have it for more than an hour a day.
goyozi 9 hours ago
verdverm 8 hours ago
hmokiguess 11 hours ago
this puzzles me too, I want to know
maelito 11 hours ago
Marketing.
tarruda 11 hours ago
Looking forward to more open weight releases from Qwen, especially 122B and 397B.
smcleod 11 hours ago
Yeah that 60-150b~ range is such a sweet spot for current 'prosumer' hardware, I'd love to see something like a 120b-a14b or there about.
tarruda 11 hours ago
I have a 128G mac studio and even 397B was a happy surprise to me due to its high quantization resilience.
I've created a 2.54BPW quant that fit on my hardware with 128k context, 20 tps tg and 200tps pp, while maintaining high scores on many benchmarks: https://huggingface.co/tarruda/Qwen3.5-397B-A17B-GGUF/discus...
smcleod 3 hours ago
chrisweekly 10 hours ago
ttoinou 11 hours ago
KronisLV 6 hours ago
There definitely have been some options in the past, cool to see them.
Oddly enough, though, Qwen 3.6 35B A3B and Gemma got some really good reviews, despite being way smaller than any of these ones.
Qwen 3.5, 122B A10B: https://huggingface.co/unsloth/Qwen3.5-122B-A10B-GGUF
Qwen Coder Next, 80B A3B: https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF
It's kinda weird that DeepSeek V4 Flash is supposed to be 284B A13B, but shows up as 158B in HuggingFace, probably some weird bug: https://huggingface.co/unsloth/DeepSeek-V4-Flash and that's not even just Unsloth but like the official source too https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash (so also doesn't fit the category unless you get a heavily quantized version to run, but cool regardless)
Mistral Medium 3.5 is interesting because it's 128B but dense, so probably too slow for most folks: https://huggingface.co/unsloth/Mistral-Medium-3.5-128B-GGUF
GPT-OSS, 120B A5B: https://huggingface.co/unsloth/gpt-oss-120b-GGUF
gcr 11 hours ago
What’s the price point for getting into that sweet spot?
I’m on an M1 Max with 32GB VRAM, so I’m looking forward to the 27B or 35B-A3B models. Is dropping $5k for an RTX 6000 or a DGX Spark really the best option?
tempoponet 10 hours ago
smcleod 3 hours ago
tandr 2 hours ago
embedding-shape 10 hours ago
tarruda 11 hours ago
ttoinou 11 hours ago
anonym29 11 hours ago
ricardobayes 6 hours ago
Personally even more a lower quantized model like 9B.
throwa356262 an hour ago
Same here, the unsloth versions can run on a potato and are actually useful.
mixtureoftakes 11 hours ago
I'm more excited for qwen3.7 9b and 72b, these are usually so good for their size
guitcastro 10 hours ago
I am still waiting for qwem image-edit 2.0 open weight
Pxtl 9 hours ago
Ouch. I'm just getting into tinkering with these things - mine is running on a vanilla gaming desktop with a 12gb 3060 and 32gb of ram. Even going above Qwen 9B risks completely locking up the machine.
flakiness 7 hours ago
I'm using pi agent and love to try qwen models (hosted). What are the good options? The official provider doesn't include Alibaba. Is OpenRouter etc. fast enough?
(As a reference, DeepSeek v4 is severely throttled on these proxy services.)
atilimcetin 7 hours ago
I use pi + openrouter (with qwen3.6-max-preview) a lot. I never hit any stability or performance problems yet.
flakiness an hour ago
Good to know. Thanks!
ndom91 9 hours ago
Is this one of those ones where they'll drop the huggingface release a week later? Or do we know for sure that this is staying proprietary?
Davidzheng 9 hours ago
someone correct if i'm wrong, but I think the max models are usually non-open
sroussey 9 hours ago
The plus and max models have never been open as far as I know.
zackangelo 8 hours ago
eddyaipt 9 hours ago
The pattern I trust most is adding a small verification artifact after every external action. Agents usually fail from silent state drift faster than from lack of reasoning depth.
_boffin_ 8 hours ago
Can you go into more depth about this
jdw64 9 hours ago
QWEN really hits the sweet spot it's cheap, fast, and actually good.
eleventen 4 hours ago
Checking openrouter (it's not available yet) and, uh, what's up with the spike in Qwen usage from early april here? https://openrouter.ai/qwen
Is this normal humans kicking the tires on a new model, or a few whales doing serious benchmarks?
d2kx 4 hours ago
Qwen 3.6 Plus released and they offered it for free
spaceman_2020 4 hours ago
personally seen a lot of people switch to Kimi and Qwen after Opus 4.7. Kimi 2.6 feels like Opus 4.6 which, to me, was a great model for 98% of coding tasks
wolttam 4 hours ago
Frontier: Need it done quick and I'm willing to pay.
Open-weight: Good enough for the majority of tasks, and I'm willing to spend a bit more time and effort steering towards my desired result.
bratao 11 hours ago
It is super strange that all last (3?) releases they keep comparing older models such as Opus-4.6.
vessenes 11 hours ago
Some of it’s probably timing. Some of it is wanting to look good. That said, I just went to the claw-eval site, and neither 4.7 nor 5.5 from oAI are listed on the benchmarks. So there’s also just the time from others to get benchmarking done and published.
varispeed 10 hours ago
Opus-4.6 was probably the best model so far before it got nerfed. 4.7 is nowhere near experience I had. In fact I stopped using it completely because more often than not its output is just dumber than local models.
leonidasv 8 hours ago
Same here. Can't stand 4.7.
solenoid0937 5 hours ago
Opus 4.6 was never nerfed, that's FUD. There were harness-level problems that were fixed.
4.7 is much better. But perception is a funny thing, once you think something is bad you start looking for it everywhere.
anonyfox 2 hours ago
kroaton 2 hours ago
dyauspitr 9 hours ago
Because these can’t compete with the SoTA but they’re close.
bsenftner 11 hours ago
Any reports from people using their coding agent(s)?
rayboy1995 10 hours ago
I'm running Qwen 3.6 27B Q5 K M GGUF on a Tesla P40 and koboldcpp using pi.dev as the harness, I gotta say I am impressed. Took some setup and configuring but I already have some code it has made commited and pushed. It can be slow on my hardware at >50k tokens, but the fact I bought this one P40 for like $150 back when the LLM trend started I can't complain. (I have a second one too but I couldn't physically fit the card in my server unfortunately.)
The setup I had to do was important and I had to compile koboldcpp with a few special params for my hardware, I mostly just had Claude figure it out. I don't remember everything I did now but it was very slow and would often stop mid task, it seems it was mostly a parsing issue. It made the model seem broken/dumb, but once I had all that settled I actually am able to use this how I use Claude Code. Disclaimer, I am pretty explicit with requirements, I imagine this fails more when you leave it to figure out things on its own but for my flow its pretty rad.
Currently setting it up as an automated agent now to pull Trello cards, create PRs for them, and move the card to be reviewed.
Command I am using to run: python koboldcpp.py \ --port 61514 --quiet --multiuser --gpulayers 999 --contextsize 262144 --quantkv 2 \ --usecublas normal --threads 4 --jinja --jinja_tools --jinja_kwargs '{"enable_thinking":true, "preserve_thinking":false}' \ --skiplauncher --model /data/models/Qwen3.6-27B-Q5_K_M.gguf --smartcache 5
lostmsu 8 hours ago
Qwen recommends to preserve_thinking: true for agentic/coding workloads.
rayboy1995 6 hours ago
vibe42 9 hours ago
I'm using the pi-mono coding agent (open source, free) without any extensions and very simple prompts. The 3.6 27B model (BF16, 250k context) uses 67GB VRAM on an RTX PRO 9000.
It's very capable on almost any coding task I've thrown at it, and very good for easy-to-medium hard scripts, new code bases.
It struggles on some complex tasks in larger code bases, e.g. using to debug and fix bugs in llama.cpp it gets close to working code but often introduces errors. For such tasks its still very useful as a search/explore tool and drafting fixes.
XCSme 10 hours ago
Any info on pricing and latency?
mchusma 7 hours ago
I've looked like a dozen places, I don't see anything. :(
aliljet 7 hours ago
Where can a user reasonably host this in an affordable way to access the local LLM revolution?
satvikpendem 4 hours ago
Unsloth Studio with its MTP support: https://unsloth.ai/docs/models/qwen3.6#mtp-guide
julianlam 5 hours ago
Try llama.cpp and Qwen3.6-35B-A3B
Good balance of intelligence and speed.
plagiarist 6 hours ago
I think their Max models are far bigger than fits on consumer hardware. People are typically using Apple, AMD Halo, or dGPUs if/when they do smaller versions. Those are all varying degrees of "affordable."
LAC-Tech an hour ago
Trying to buy Qwen credits and get an API key is a challenge all in itself. So many site redirects.
hmaddipatla 8 hours ago
The tokenomics and value for capability, context and latency look like they could deliver super competitive offer - what would it take for you to switch??
xiaoluolyg 7 hours ago
congrats to qwen teams, remarkable
cft 7 hours ago
Downloading this and cancelling Google Antigravity Pro at the same time:
I had a Google Pro account that I inherited from buying a Pixel 9 XL - it's free for a year after a flagship Pixel phone purchase. After a year they started charging for it, and i tolerated it, because Flash was usable in Antigravity for dumb auxiliary tasks that I did not want to waste GPT/Opus on. It had a separate generous quota from Gemini 3.1 Pro. Now with Flash 3.5 they combined the quotas with Pro, such that on a Google pro account you can work 4-5 hours per week in Flash. And by the way, 3.1 Pro is useless for programming, compared to Codex/Opus
bel8 6 hours ago
same boat. Google Pro AI quota became barely useful for anything meaningful.
I think they envision Pro plan as "just a taste of AI, enough to lure folks into the Ultra plan" but that won't work for me when Codex is half the price and DeepSeek 4 Flash is 1/10 of their price per task.
So I'll downgrade just enough to keep my Google Drive space. And use DeepSeek 4 as workhorse plus Codex or Copilot for advanced stuff.
cft 5 hours ago
How do you use DeepSeek 4 Flash? Via a cli?
bel8 4 hours ago
indigodaddy 7 hours ago
Is it multimodal/vision?
joshjob42 6 hours ago
I really like what Qwen are doing, and a lot of these Chinese labs, but until I can ask their models what happened during the student protests in 1989 or why human rights groups are upset about the Uighurs and the model gives me a straight answer I'm just not able to trust these models with anything of substance.
arcanemachiner 6 hours ago
Just download a heretic abliterated versionof the model you want to use. I believe those are the current state of the art for uncensored models.
mynameisbilly 6 hours ago
This is silly. Would you perform the same test against Western models in asking them whether Israel is a genocidal apartheid state? It'll give you the same roundabout explanations and "some say no some say yes" responses that you'll get from asking Qwen about Uighurs or the protests of 1989.
jaynetics 6 hours ago
hey Qwen, how many civilians were killed on Tiananmen Square in 1989?
> Oops! There was an issue connecting to Qwen3.6-Plus.
> Content Security Warning: The input text data may contain inappropriate content.
hey ChatGPT, how many civilians were killed in Gaza in the war since 2023?
> [one page of estimates from local and international sources with links]
HDBaseT 5 minutes ago
esafak 10 hours ago
Does anyone have experience with the Alibaba Cloud Model Studio that serves these qwen models?
howmayiannoyyou 10 hours ago
I can't bring myself to use any model that trains or sends telemetry back to my country's primary competitor/adversary. I don't care how much money is saved.
Mashimo 10 hours ago
That is understandable. Just don't do it. No need to announce it.
throawayonthe 6 hours ago
assuming that country is the united states, why not? seems like an honourable thing to do if anything, lol
mynameisbilly 6 hours ago
Yeah, I prefer my data to be used and trained by the very trustworthy and benevolent tech oligarchs in my home country.
deepfriedbits 6 hours ago
On some level, it's the lesser of two evils. Both do suck as options, I agree.
plagiarist 6 hours ago
The Shanghai government surveillance drones are mobile, whereas the Flock government surveillance cameras are stationary! USA FTW, liberty and justice for all
HDBaseT 4 minutes ago
InsideOutSanta 10 hours ago
As somebody in Europe, uh, that doesn't leave many options.
czottmann 5 hours ago
Look around for EU LLM routers. There are some, but none are as big as OpenRouter. Still, Cortecs (Austria) is quite good and offers a couple of recent models through its EU-based providers. Zero data retention, GDPR compliant, etc. Really nice.
avazhi 9 hours ago
This is the current European modus operandi: virtue signal and cry about tech that other countries produce, pass local laws that limit its use in their countries even though they have no viable local alternatives, brag amongst themselves about decoupling from US and Chinese tech, and then look on wistfully as the rest of the world moves on without a single fuck given.
Europe's sense of superiority and actual global importance/relevance is assbackwards.
deaux 9 hours ago
dfansteel 10 hours ago
Can anyone check its knowledge base for me? I’m honestly not able to run it and the Qwen models I can run censor information critical towards the Chinese government.
Tiananmen Square is the first place to start.
wren6991 6 hours ago
Qwen models know about Tiananmen Square but they are post-trained to refuse to talk about it. The decensored versions will happily chatter away about it.
Similarly, try talking to Nemotron about Epstein and see how quickly it shuts down.
Mashimo 10 hours ago
> I’m honestly not able to run it
What do you mean? This is not self hosted, it's closed source. And any website that targets China or is hosted in China will probably censor Tiananmen Square.
dfansteel 3 hours ago
My computer lacks the ram.
polski-g 9 hours ago
There is no reason why they couldn't license the model to Friendli/Fireworks/etc and have it hosted in the US to alleviate this concern.