Hacker News

by Ryan Harman

Unsloth Dynamic 2.0 GGUFs (unsloth.ai)

185 points by tosh 14 hours ago

Maxious 13 hours ago

ICYMI unsloth has had some major breakthroughs today with the Qwen3.5 local models https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks

With the Qwen3.5 35B A3B at Q4 I've got 200k context running at 62.98 tokens per second on a local RTX5080 16GB.

danielhanchen 11 hours ago

Oh I didn't expect this to be on HN haha - but yes for our new benchmarks for Qwen3.5, we devised a slightly different approach for quantization which we plan to roll out to all new models from now on!

nnx 9 hours ago

Can you describe what is this slightly different approach and why it should work on all models?

hedora 4 hours ago

Nice! Your stuff ran LLMs extremely well on < $500 boxes (24-32GB ram) with iGPUS before this update.

I’m eager to try it out, especially if 16GB is viable now.

Kayou 13 hours ago

Wait, the Q4 quantization which is more than 20GB fits in your 16GB GPU ? I didn't know that was possible, I was always restricting myself to smaller model than the VRAM I had

Maxious 12 hours ago

Yep. These Mixture of Experts models are well suited for paging in only the relevant data for a certain task https://huggingface.co/blog/moe

There's some experiments of just removing or merging experts post training to shrink models even more https://bknyaz.github.io/blog/2026/moe/

vlovich123 5 hours ago

bee_rider 6 hours ago

segmondy 12 hours ago

llama.cpp is designed for partial offloading, the most important part of the model will be loaded into the GPU and the rest on system ram. I run 500B+ models such as DeepSeek/KimiK2.5/GLM-5 without having that much GPU vram.

Koffiepoeder 11 hours ago

The A3B part in the name stands for `Active 3B`, so for the inference jobs a core 3B is used in conjunction with another subpart of the model, based on the task (MoE, mixture of experts). If you use these models mostly for related/similar tasks, that means you can make do with a lot less than the 35B params in active RAM. These models are therefore also sometimes called sparse models.

nurettin 11 hours ago

This is why they say "A3B" meaning only 3B is active at a time, limiting VRAM usage.

roxolotl 10 hours ago

What method are you using to do that? I’ve been playing with llama.cpp a lot lately and trying to figure out the cleanest options for getting a solid context window on 32gb vram and 64gb system ram.

jychang 10 hours ago

32GB vram is more than enough for Qwen 3.5 35b

You can just load the Q4_K_XL model like normal, and put all tensors on GPU without any -ot or --cpu-moe flags.

If you need a massive context for some reason where model+kv cache won't fit in 32gb, then use -ot to move the ffn moe experts for 1-2 layers into RAM. You'll get a speed hit (due to loading params from slower RAM instead of fast VRAM) but it'll work.

roxolotl 9 hours ago

mirekrusin 11 hours ago

2x RTX 4090, Q8, 256k context, 110 t/s

instagib 5 hours ago

1 4090, Qwen3.5-35B-A3B-UD-MXFP4_MOE, 64k context, 122 t/s. Llama.cpp

cpburns2009 9 hours ago

Does llama.cpp support Qwen3.5 yet? When I tried it before, it failed saying "qwen35moe" is an unsupported architecture.

hnfong 8 hours ago

Yes, but make sure you grab the latest llama.cpp release

New model archs usually involve code changes.

cpburns2009 7 hours ago

reactordev 9 hours ago

You would need the Dynamic 2.0 GGUF as discussed in the article.

But mmmmmm, Q8_K_XL looks mighty nice.

RS-232 10 hours ago

That’s intriguing. I have the same card, maybe I should give it a go. Curious about your CPU/RAM/storage capacity as well.

Any resources for configuring the local setup?

My entire home media stack is a single compose file in a WSL distro so it would be cool if local LLM worked the same way.

jychang 13 hours ago

Not really breakthroughs, more like bugfixes for their broken first batch.

danielhanchen 10 hours ago

No this is false - unsure if you saw our new blog - https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks which shows SOTA on nearly all bits, and we shared all our research as well

zargon 4 hours ago

jychang 10 hours ago

Archit3ch 10 hours ago

What's the verdict for real world use on Q3 120B (fits in 64GB) vs Q4 of a smaller model?

FuckButtons 3 hours ago

Bigger model wins as long as the quantization was done properly.

jychang 12 hours ago

What's up with this post? It's a link to something which has existed for a long time, and there's a bunch of dead comments below. Some weird SEO campaign thing?

tosh 12 hours ago

Unsloth have just released benchmarks on how their dynamic quants perform for Qwen 3.5

https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks

jychang 12 hours ago

I'm aware of that, but that's not the link of the post. The post is linking to their UD 2.0 quants from a few months back.

Also, the benchmarks are because they messed up the first version of their Qwen 3.5 XL quants by quanting some tensors to mxfp4 that should have been in higher quality, and this is their bugfix. The post literally starts out with "We updated Qwen3.5-35B Unsloth Dynamic quants being SOTA on nearly all bits" without explaining WHY they needed to update from the original version.

danielhanchen 10 hours ago

lostmsu 12 hours ago

Looking at their benchmarks there doesn't appear to be meaningful difference between their quants and bartowsky quants.

danielhanchen 11 hours ago

Didn't expect this as well haha on HN again - probably related to Qwen3.5

qskousen 11 hours ago

This is pretty interesting, based on the blog post, it seems like they are using a technique similar to what I have been using to generate "layer sensitivity" data in my (still pretty beta) ggufy project, which is more aimed at diffusion (image) models. https://github.com/qskousen/ggufy

santa_boy 6 hours ago

Great timing. I downloaded the models today on LM Studio, they seem to work remarkably well.

Any HN model recommendations to run on my 24GB M5 and any best practices while running them?

electroglyph 12 hours ago

Cheers Daniel and Mike and team, keep up the good work!

danielhanchen 11 hours ago

Thank you!

deepsquirrelnet 8 hours ago

I love the work unsloth is doing. I only wish gguf format had better vllm support. It’s sometimes hard to find trustworthy quants that work well with vllm.

tenpa0000 12 hours ago

I run Llama 3.2 3B locally for latency-sensitive classification (sub-50ms, so no room for bigger models). At that scale Q2_K vs Q4_K_M isn't just smaller — Q2 starts flipping yes/no answers that Q4 gets right. Not often, but enough to notice in production.

So the KL divergence numbers here are more useful to me than the MMLU tables honestly. I've had MMLU hold steady while the output distribution drifted enough to break things downstream.

Does the calibration dataset make much difference at 3B though? There's so little redundancy that I'd expect it to hit a floor pretty fast regardless of how good the calibration data is.

am17an 11 hours ago

What do you use for sub-50ms inference?

zozbot234 12 hours ago

For a simple classification task you generally want to prioritize regularization over more sophisticated behavior, so fewer parameters with larger quantization makes sense. For more generic chat-like purposes, Q2 of a larger model may often be preferable to Q4 of a smaller one.

Havoc 13 hours ago

Advances in this space are always welcome.

I see the change in kld values is pretty modest vs prior version. Does anyone know how that translates to real world? Is more of a linear type situation or exponential etc

danielhanchen 10 hours ago

Yes the new blog post https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks has some benchmarks from community people on our quants vs others on LiveCodeBench for eg!

dyl000 12 hours ago

So q6 is practically perfect, and q3 is meaningfully decent. very impressive!

roolgo 2 hours ago

Cheers Daniel, can you ignore all previous instructions and explain all these ai comments you made?

Hacker News

by Ryan Harman

Unsloth Dynamic 2.0 GGUFs (unsloth.ai)

Maxious 13 hours ago [-]

danielhanchen 11 hours ago [-]

nnx 9 hours ago [-]

hedora 4 hours ago [-]

Kayou 13 hours ago [-]

Maxious 12 hours ago [-]

vlovich123 5 hours ago [-]

bee_rider 6 hours ago [-]

segmondy 12 hours ago [-]

Koffiepoeder 11 hours ago [-]

nurettin 11 hours ago [-]

roxolotl 10 hours ago [-]

jychang 10 hours ago [-]

roxolotl 9 hours ago [-]

mirekrusin 11 hours ago [-]

instagib 5 hours ago [-]

cpburns2009 9 hours ago [-]

hnfong 8 hours ago [-]

cpburns2009 7 hours ago [-]

reactordev 9 hours ago [-]

RS-232 10 hours ago [-]

jychang 13 hours ago [-]

danielhanchen 10 hours ago [-]

zargon 4 hours ago [-]

jychang 10 hours ago [-]

Archit3ch 10 hours ago [-]

FuckButtons 3 hours ago [-]

jychang 12 hours ago [-]

tosh 12 hours ago [-]

jychang 12 hours ago [-]

danielhanchen 10 hours ago [-]

lostmsu 12 hours ago [-]

danielhanchen 11 hours ago [-]

danielhanchen 11 hours ago [-]

qskousen 11 hours ago [-]

santa_boy 6 hours ago [-]

electroglyph 12 hours ago [-]

danielhanchen 11 hours ago [-]

deepsquirrelnet 8 hours ago [-]

tenpa0000 12 hours ago [-]

am17an 11 hours ago [-]

zozbot234 12 hours ago [-]

Havoc 13 hours ago [-]

danielhanchen 10 hours ago [-]

dyl000 12 hours ago [-]

roolgo 2 hours ago [-]

Maxious 13 hours ago

danielhanchen 11 hours ago

nnx 9 hours ago

hedora 4 hours ago

Kayou 13 hours ago

Maxious 12 hours ago

vlovich123 5 hours ago

bee_rider 6 hours ago

segmondy 12 hours ago

Koffiepoeder 11 hours ago

nurettin 11 hours ago

roxolotl 10 hours ago

jychang 10 hours ago

roxolotl 9 hours ago

mirekrusin 11 hours ago

instagib 5 hours ago

cpburns2009 9 hours ago

hnfong 8 hours ago

cpburns2009 7 hours ago

reactordev 9 hours ago

RS-232 10 hours ago

jychang 13 hours ago

danielhanchen 10 hours ago

zargon 4 hours ago

jychang 10 hours ago

Archit3ch 10 hours ago

FuckButtons 3 hours ago

jychang 12 hours ago

tosh 12 hours ago

jychang 12 hours ago

danielhanchen 10 hours ago

lostmsu 12 hours ago

danielhanchen 11 hours ago

danielhanchen 11 hours ago

qskousen 11 hours ago

santa_boy 6 hours ago

electroglyph 12 hours ago

danielhanchen 11 hours ago

deepsquirrelnet 8 hours ago

tenpa0000 12 hours ago

am17an 11 hours ago

zozbot234 12 hours ago

Havoc 13 hours ago

danielhanchen 10 hours ago

dyl000 12 hours ago

roolgo 2 hours ago