Running local models is good now (vickiboykis.com)

402 points by jfb 3 hours ago

c0rruptbytes an hour ago

I don't know about good, I use a lot of local models and they're still pretty painful to run locally

You have dense models (qwen 27b, gemma 31b) who are pretty smart, but pretty slow

You have MoE models (gemma 26b, qwen 35b, north mini code 30b) who are pretty fast, but make a lot of mistakes

You need a lot of memory to run these well, quantization makes tool calling weaker, so most run at 4 bit quants and are wondering why it kinda sucks and that's because you've essentially lobotomized the model (I recommend unsloth quants, i recommend 6bit for MoEs and 5bit for dense)

So you need a lot of compute to make the pre-fill fast, you need bandwidth to make the decode fast, you need a lot of memory to hold everything - lot of ifs

On top of that, your laptop becomes a loud hot churning machine, it's uncomfortable to work with.

So are they good? not really. Do they work? yes

saghm an hour ago

This is basically my experience as well. I have a moderately recent but high spec desktop (Radeon 6900 XT with 16 GB VRAM, Ryzen 9 7900X 12-core, 64 GB system RAM), and I tried out some recommended models with ollama a month or two ago. Anything not geared specifically towards coding seemed to struggled with actually making tool calls instead of just stating the actions they would take without making them (and trying to get help from them to explain what I needed to configure to change that behavior was useless; qwen refused to believe that it was running in ollama and insisted that it was running from the Alibaba cloud without access to my local system), and the models intended for coding were barely thinking faster than I could type (if they had any ability to show thinking at all).

The best "free" experience I've found is using OpenCode with Big Pickle. It's not especially smart, so it often won't produce the correct result the first time, but the free tier is generous enough that I don't think I've hit the limit more than twice over around a month with frequent multi-hour sessions. If running locally is truly the goal, it's not going to fit the bill, but if the goal is just "get the best experience without having to pay for a sub or tokens", it's the least bad option I've found so far.

zozbot234 an hour ago

Maybe we shouldn't be running these models on laptops with their thermally constrained form factor, and we shouldn't expect quick inference on a par with a large cloud-based platform either, at least not for near-SOTA model quality. It's still worth it to avoid becoming massively reliant on centralized services.

greenavocado an hour ago

I have a 5070 12 GB laptop GPU and can hit 72 tokens per second in the first couple thousand tokens before dropping to mid-high 50s after about 15k context

This setup is extremely optimized down to the last flag. Changing any param from temp and below craters performance.

  # 1,257 tokens 17s 72.18 t/s

  $env:CUDA_DEVICE_SCHEDULE = "SPIN"
  cd D:\src\llama.cpp\
  .\build\bin\Release\llama-server.exe `
    --port 8080 `
    --host 127.0.0.1 `
    -m "D:\LLM\Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf" `
    -fitt 2048 `
    -c 98304 `
    -n 32768 `
    -fa on `
    -np 1 `
    --kv-unified `
    -ctk q8_0 `
    -ctv q8_0 `
    -ctkd q8_0 `
    -ctvd q8_0 `
    -ctxcp 64 `
    --mlock `
    --no-warmup `
    --spec-type draft-mtp `
    --spec-draft-n-max 2 `
    --spec-draft-p-min 0.1 `
    --chat-template-kwargs '{\"preserve_thinking\": true}' `
    --temp 0.6 `
    --top-p 0.95 `
    --top-k 20 `
    --min-p 0.0 `
    --presence-penalty 0.0 `
    --repeat-penalty 1.0

themanualstates 20 minutes ago

nateb2022 26 minutes ago

mattmanser 17 minutes ago

adam_arthur an hour ago

Gemma 4 is particularly good at pipeline/automation tasks.

It outperforms all the Qwen models (even 100B+) for rule following/automation style tasks in my experience. Its image interpretation is also very good, and out-benchmarks Opus.

Qwen seems to ignore instructions and consistently outputs incorrect formats (when token generation format is not explicitly constrained)

But yes, on the DGX Spark Gemma 31B Q4 with MTP runs around 20 tok/s and Gemma 26B A4B around 60 tok/s. Still quite slow. But on a high end Nvidia card would run significantly faster and still fit in memory.

I'd recommend for anyone getting into local models to focus on memory bandwidth over RAM. Models under 100B parameters are now sufficient and hugely useful for automation.

I agree that for coding/creation use cases, there's still not a compelling argument for local models.

But e.g. if you want to scan a list of stocks and interpret news/high pass filtering, interpreting logs, interpreting screenshots, the local models are more than sufficient already.

dstryr 16 minutes ago

This is not my experience at all. Even the Nous Research guys have stated that "Qwen3.6-27B is the canonical local model to use Hermes Agent with" [https://old.reddit.com/r/LocalLLaMA/comments/1sz2y76/ama_wit...]. I am finding the same when used with Pi and OpenCode.

Gemma will just stop mid-tool call. It's been slower and I've had to reduce context size to run it. Qwen3.6 27b has been rock solid using club 3090's single card setup for agentic use -- https://github.com/noonghunna/club-3090/blob/master/docs/SIN...

adam_arthur 10 minutes ago

gopher_space 12 minutes ago

In my mind it’s a question of knowing what you want to build and how to divide the project into tasks your local setup can handle.

If you don’t need the machine to respond instantly (or explain your own business model to you) everything can be local and it’s been like that for a few years now.

trouve_search 24 minutes ago

On a 5090, gemma4 26B runs at 350TPS with the command below [1] and gemma4 31B is around 150TPS with a similar command.

I'm really surprised how much slower a DGX spark is for the same price.

1. Here's my command.

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \ vllm serve cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit \ --dtype auto \ --gpu-memory-utilization 0.95 \ --kv-cache-dtype fp8 \ --enable-chunked-prefill \ --enable-prefix-caching \ --trust-remote-code \ --enable-auto-tool-choice \ --tool-call-parser gemma4 \ --reasoning-parser gemma4 \ --max-num-batched 16000 \ --max-model-len 64000 \ --max-num-seqs 12 --speculative-config '{"model": "./gemma-4-26B-A4B-it-assistant", "num_speculative_tokens": 4}'

adam_arthur 13 minutes ago

aftbit 43 minutes ago

IMO running local models "well" still requires an expensive hardware investment. You really want 96GB of VRAM on a modern Blackwell arch to run these models with decent KV cache. Trying to run them on a unified memory Mac, an AI Max AMD processor, or a DGX Spark-alike is really just asking for trouble. Prefill kills perf.

If you throw the right GPUs at the problem, they become much better - but still not quite in the realm of Sonnet or DeepSeek 4 Flash, let alone Opus / DeepSeek Pro or Mythos/Fable/GPT-5.5.

Given enough budget, power, and cooling, you can run some pretty good data pipelines, but for code, I think it still makes sense to shell out to an API provider most of the time.

eek2121 36 minutes ago

Not really, Qwen 27b offloads to a decent gaming GPU (RTX 4090 in my case) without needing tons of RAM.

mathisfun123 29 minutes ago

heipei an hour ago

Depends on what you mean by "local". On your Macbook, large dense models like Qwen 3.6 27B will be slow, sure. On a local workstation with a dedicated RTX card you can get > 100 tps, which is more than good enough to work with it, and faster than cloud models in many cases.

c0rruptbytes 3 minutes ago

I'm talking about the common use case that I think hacker news people have:

you get a macbook for work, you run the macbook

they're not going to start giving GPUs to employees to run local models

jstanley an hour ago

But how smart is it? All the people running local models never seem to mention that they are way dumber than cloud models.

I don't care how many tokens per second of nonsense it can generate.

notnullorvoid 21 minutes ago

myaccountonhn an hour ago

heipei an hour ago

everdrive 32 minutes ago

What counts as a lot of memory? What could someone do with 16 GB of RAM?

zozbot234 24 minutes ago

Modern inference engines can stream in weights from SSD in order to save on RAM, but this makes inference very slow, especially for the trivial single-session case. (Jury is still out on whether batching multiple sessions together can mitigate this well enough, but even then that's mostly helpful for the "running lots of inferences overnight and getting the fresh results first thing in the morning" case. Which is interesting (the big third-party suppliers don't really offer a way of doing this at reasonable cost) but a bit of a niche.)

abalashov 12 minutes ago

Not a ton. I'd say 64 GB minimal to play, 96-128 GB better.

ValdikSS 27 minutes ago

Gemma e2b, Gemma e4b. It's made for smartphones basically. You can run e2b with 8GB RAM.

trouve_search 23 minutes ago

gemma 12B 4bit quant; try something with MTP and an AWQ quant

monegator 24 minutes ago

gemma runs pretty well

greenavocado an hour ago

4 bit unsloth quants are good if you never ask for more than 20k context, use it as autocomplete on steroids, and never delegate serious questions to it

dominotw 29 minutes ago

maybe painful if you are using it like a chatbot. you are sitting there waiting for response. vs ambient ai like automatically classifying your family pics and discarding random things like parking floor number pic.

i use it usecases like that latter and they are fine.

iwontberude an hour ago

They are good if you were clever enough to buy a powerful enough rig before memory went up. For everyone else I say just wait. M1 Ultra 128GB and higher is sufficient to run gemma4:31b-mlx or qwen3.6:35b-mlx with subagents. It’s only slow if you don’t know how to plan your work effectively.

hypfer 2 hours ago

After having been a happy user of Qwen3.6-27B for a few weeks, due to being away from the hardware, I'm currently forced to use Claude Sonnet 4.6

It is such a downgrade. I don't understand how that's even possible. The thing has so many strongly-held opinions I did not ever ask it for, talking just way too much and generally feeling somehow dumber.

Of course, being significantly larger, it will encode more knowledge, but that doesn't help me when I hate talking to it. And all that on top of the fact that talking with it costs real money.

I wonder what it might be that makes me hate it so much. Maybe because it doesn't see itself as a tool but almost an equal? As if its opinions would have weight.

Qwen too can act like an overeager intern, but if you tell it that it is an idiot, it will drop that ego. Not so much with Claude. In my experience, anyway.

Anyway, point is: full ack on that headline.

ggerganov an hour ago

I haven't spent a dime on cloud inference, so cannot make a direct comparison like you. But I can 100% attest to the fact that Qwen3.6-27B is a very capable local model for coding tasks. Over the last month and a half I've been using it almost daily, either on my M2 Ultra or on my RTX 5090 box. I use it for small mundane tasks at ggml-org [0] - nothing really impressive, but definitely a helpful tool for a maintainer. I think I would be using it much more, if I didn't have to spend a lot of my time on reviewing PRs. Currently, I have a very lightweight harness - the pi agent with everything stripped (`pi -nc --offline`) and a short system prompt [1] to align it a bit with my style. About the generation speed: ~100-150 t/s on the RTX 5090 and ~40 t/s on the Mac. I definitely prefer running it on the RTX machine - it's so much faster. But for the sake of testing and getting wider experience with local configurations, I often run it on the Mac too.

[0] - https://github.com/search?q=%22Assisted-by%22+user%3Aggml-or...

[1] - https://github.com/ggml-org/llama.cpp/blob/master/.pi/gg/SYS...

kpw94 an hour ago

> About the generation speed: ~100-150 t/s on the RTX 5090 and ~40 t/s on the Mac

Curious if you can share the prefill speed too?

I run locally on a crappy desktop (some AMD iGPU with Vulkan llama.cpp, 32 GB DDR4 RAM) for experimentation. I get 15 tok/s on generation for the qwen & gemma4 MoE models. I get around 150 tok/s prefill speed.

Reason I'm asking about the prefill is looking at my stats at work, I use between 20M to peaks of 300M input tokens daily. Some of those token are cached but in general, I seem to have roughly 500x more input tokens than output. So interested in prefill tok/s stats.

Huge Thank you for llama.cpp btw!!

ggerganov 25 minutes ago

trilogic an hour ago

I also confirm that local inference is on par with proprietary cloud services (with a bit of local setup, simple agents.md and some utils skills). This local models come with tools, that's mind blowing, considering that some months ago we had to .md tools ourselves. What makes a model worth even more is "Memory". We implemented that long ago. Last time I used proprietary services was 3 months ago, don´t really need it, my subscription is going blank.

Gerganov, hope you will consider developing further the CLI cause we suffering with the server.

celrod an hour ago

What quant do you run it at? 32GB seems like cutting it close on the rtx 5090 if going 8b, but other commenters are saying 4b lobotomizes the model.

ggerganov an hour ago

fridder an hour ago

Not too shabby. I like the regular Qwen but prompt prefill on my m1max is slow as hell

StevenWaterman 2 hours ago

Yep, I daily drive Qwen3.6-27B (including for work), have done pretty much since it came out. IMO it's the only (small-ish, local) model worth using, if you can run it. It might not be as good as Opus at "add X large feature" but I don't want that in a model. I want to do the thinking while it does the typing. And Qwen 3.6 27B is perfectly good at that (while in my experience models like the 35A3B and gemma are significant downgrades)

Plus, I never have to worry about rate limits, quotas, or sitting in a queue during peak time. And I can always see its full thoughts, don't have to worry about where my data is getting sent, and know it can't get secretly nerfed.

Running on 2x 3090, 500-1000tok/s prefill and 60tok/s output at Q6_K_XL with MTP on llama.cpp, 220k tokens context window (starts to get a bit dumb above 160k ish), no KV quantization

indoordin0saur 2 hours ago

> And I can always see its full thoughts, don't have to worry about where my data is getting sent, and know it can't get secretly nerfed.

For this reason I wonder if local models are a potential business opportunity. Provide the service to engineering teams to give them a pre-built and setup GPU rig they can run in a closet. No need to worry about all the things you mentioned and clients can rest-assured their data isn't disappearing into a sketchy data center. There might be regulatory reasons that make on-prem setups appealing as well.

amoshebb 2 hours ago

suncemoje an hour ago

cyanydeez an hour ago

hughw an hour ago

Just this morning I tweaked my single 3090 setup too:

  OLLAMA_FLASH_ATTENTION=1
  OLLAMA_KV_CACHE_TYPE=q8_0
  OLLAMA_CONTEXT_LENGTH=180000
and that fits in 23GB.

[edited for format]

iamtheworstdev 25 minutes ago

are you running an NVLink? I have the same setup but no NVLink and it feels like it's best just splitting the 3090s to run separate models concurrently. But I also have no idea what I'm doing.

QuantumNoodle an hour ago

Do you have any resources on hardware necessary for running models and tweaks? I see you mention 2x 3090 and I wanted to do more search on what hardware is satisfactory for what models.

giancarlostoro 2 hours ago

> (starts to get a bit dumb above 160k ish)

If open models can ever hold roughly 600k token windows, I'll be really excited, I found that around 300 ~ 400k of Claude reading through your codebase results in better outputs. I also have Claude read official docs instead of "guessing" as to how to do something.

StevenWaterman 2 hours ago

0xc133 an hour ago

cyanydeez an hour ago

epistasis an hour ago

> talking just way too much

OMG this is such an annoying property, just shut the hell up please, and be concise.

I suspect that this is an artifact of the thinking property, but please just summarize the thinking process far more concisely, where a single sentence answer is more than sufficient the frontier models seem devoted to going on to a minimum of 5 paragraphs and offering 3-5 new directions.

And requests to please only offer a single step at once, or single option at once, or to even stop eagerly offering future directions is really hard to prompt correctly.

And look, there I did exactly what I was complaining about...

bityard an hour ago

I'm not sure to what degree you can influence how a model thinks, but you can definitely hide the thinking tokens and tell the model how you want it to talk to you.

For example, the Claude web UI has an Instructions field where I have told it never to congratulate or praise me for asking questions. Earlier Copilot models used a ridiculous number of emoji and bullet lists when answering literally every prompt, I told it to knock that off and prefer detailed paragraphs in prose.

Local agents/frameworks/whatever all have their equivalents for overall user preferences.

epistasis 5 minutes ago

illegalsmile an hour ago

That's why you have to give claude and others directives/.md at the beginning so it doesn't go off the deep end with suggestions.

epistasis an hour ago

radium3d 2 hours ago

If you think about it, they're splitting the power across millions of users. Essentially, these AI companies have YOUR hardware that YOU are paying (them) for in a cabinet at some data center. This means the hardware could easily be run locally for inference for these 'big' models. It's just a problem of dynamics-- RAM is being bought in bulk by these companies through these B200 style cards, instead of sold slowly through the open public markets.

This is likely due to a combination of mass funding for the AI companies, but also they are trying to governmentally restrict which countries get access to these cards so certain countries get a head start. The only way to lock that down is to have them literally locked in their own GPU prisons (data centers). Third reason is it does make it possible to train the models faster by having them in the same data center connected directly. Having them distributed to everyone would slow down training considerably.

The current way to 'own' decent RAM and GPUs right now is through the stock market it seems.

kitd 2 hours ago

Funny that coding agents have personalities, including "that colleague" you want to avoid even if you know they're probably quite good at what they do!

derethanhausen 2 hours ago

I would not generalize based on experiences with Sonnet. The flagship models (Opus being the claude equivalent) are dramatically better.

hypfer 2 hours ago

Opus in my experience is equally unpleasant "character"-wise, but at least it actually gets stuff done more often, so it's at least slightly more earned at that. It's still a neurotic cargo-culting dogmatic idiot, but one that at least sometimes does produce deliverables instead of only bottom-tier HN-esque opinions.

Hmm. I think I might just fundamentally disagree with Anthropic about the idea of what a "tool" should be.

MostlyStable 2 hours ago

Curious if you have tried custom instructions. I was never quite as unhappy with Claude's voice as you appear to be, but there were several things I didn't like. A custom prompt fixed almost all of them.

clickety_clack 2 hours ago

I think it would be very hard to convince someone to pay $100/mo to go back to Claude if they have a local model up and running, particularly now that model improvement has basically been stalled for the last 6 months. It’s so easy to set it up for yourself now too with things like LM studio. That said, there will always be unsophisticated users who can’t figure it out, so there will always be someone there to pay.

MostlyStable an hour ago

Scoundreller 2 hours ago

chrisweekly 2 hours ago

giancarlostoro 2 hours ago

There's a model on Huggingface where someone takes Qwen and makes it think Opus style, and that one seems to be decent, not sure if they have the 27B variant in that style. I do wonder if you can tweak your system prompt to force Qwen to behave better?

StevenWaterman 2 hours ago

You read the OP backwards, they said Sonnet is a downgrade from Qwen, and prefer Qwen's tone

giancarlostoro an hour ago

whythismatters 2 hours ago

Yes, Qwopus :) I've been pleasantly surprised by its quality

giancarlostoro an hour ago

zerd 40 minutes ago

I noticed Fable was quite a bit terser, and I think it's due to changes in the system prompt [0]. They're literally saying "just give me the TLDR" and "give brief updates". You can tweak a lot of that with an AGENTS.md.

[0] https://twelvetables.blog/comparing-claude-fable-5s-system-p...

calebm 14 minutes ago

sync/ack

dackdel an hour ago

what kind of hardware do you need in order to run qwen3.6-27b

giancarlostoro an hour ago

Depends on which variant you pull down, but a single 5090 GPU (I know these are insanely expensive, but for context) could run either the Q8 or Q4_K_M version. It will not fit the 52GB version (BF16) on the other hand. So any modern Mac with a Pro or better processor and more than 52GB of RAM (don't forget VRAM for context window also matters!) would suffice, as someone else noted, probably a 128GB model would do the trick, and give you enough wiggle room to max out the context window.

My Mac only has 16GB of VRAM (20GB total - 8 is reserved for the OS) so I have to leave room for VRAM, I usually find a model that fits in 5 to 7 GB of VRAM and then max the context window as much as I can.

iagooar an hour ago

I recommend MacBook M5 Max with 128 GB of RAM to run it comfortably and fast. If you have something like a regular M4, go with qwen3.6-35b-a3d - the Mixture of Expert architecture makes it run 2-3x faster than the 27b version.

sbmthakur an hour ago

I could run it on 7900 XT with 64k context. You could run it more comfortably on a 24 gb vram.

indoordin0saur 2 hours ago

Very curious what hardware you're running this on!

hypfer 2 hours ago

The same 24GB VRAM RTX 4090 I bought to play Cyberpunk 2077 with.

Works perfectly fine in llama.cpp throwing 70+t/s at me with 128k q8 K/V context when using the IQ4_NL quant + MTP at q4 MTP K/V.

Also leaving this here because you might find it useful: https://hypfer.github.io/will-it-fit-llama-cpp/

indoordin0saur an hour ago

cdelsolar 2 hours ago

ltononro an hour ago

Well but comparing with sonnet 4.6 instead of opus 4.6,.7 or .8 doesnt make a real point I mean, pay 200 USD/month (if you have that cash, or your company has it), might not justify using local at all (unless you have some reason to suspect about data leakage)

chrisweekly 2 hours ago

Why Sonnet 4.6 not Opus?

rmunn 2 hours ago

This is the kind of thing that Anthropic et al should be worried about. As it becomes easier and easier to run local models, the ceiling of what they'll be able to charge will get lower and lower. Not that nobody will be willing to pay $$$$$ per month, but a lot of people are going to multiply the per-month charge by 12 or 24 and say "Could I set up a local model for less than that, and have it pay for itself within a year or two?" And if a significant portion of customers decide to buy instead of rent, the companies whose business model is entirely centered around renting will suddenly find themselves hurting for customers.

sathackr 2 hours ago

The opposite of that has been happening for 20 years now with cloud compute.

It won't happen with AI models either.

It's almost ingrained in the American business model now. Outsource everything. Nobody wants to manage a room full of servers when they can spend 2-3x as much and outsource that headache along with the responsibility for it.

Same will happen with AI. Whether that means paying Anthropic that premium or paying AWS.

I'm in a relatively small business, we recently had an outage related to our local infrastructure.

I got pressure from the CEO saying it wasn't reliable to host our own infrastructure anymore even though our total internal down time over the last 5 years is significantly less than even a single of the larger recent AWS outages.

Everyone wants to shuck the chore and the responsibility.

preommr an hour ago

> The opposite of that has been happening for 20 years now with cloud compute. It won't happen with AI models either.

AI is different.

Cloud computing genuinely is cheaper on average. It's better than paying for cisco servers, and at scale, it's cheaper than managed platforms (ala Heroku), and it's a coin toss for when you're in the middle ground and constantly approaching the point of rebuilding poor-man versions of existing products but with very very expensive engineering salaries.

In contrast, local models offer dramatic savings, and are magnitude of orders better in certain aspects: like stability - the performance is all over the place with traditional AI companies as they divert compute to their next big thing.

The benefits to maintaining your own infrastructure are pretty moderate to low, with very high risk.

And also, alternate models are pretty easy to use and easy to swap out unlike the vendor lock-in that exists with cloud services.

TkTech an hour ago

For many companies (country-dependent) that's not really why they use cloud services vs purchasing. It's tax shenanigans and business process overhead. OpEx vs CapEx, and a small (%) bump in the huge AWS bill no one will even notice or a $30k+ invoice for hardware that has to go through rigorous review and 3 departments.

Same reason people pay for things through the AWS marketplace (like Vanta) instead of having to go through their invoicing process.

dreambuffer 2 hours ago

It's just not comparable though is it? You need cloud services because it's physically impossible to use your single home computer as a server, CDN, load balancer, mass storage, security service, and distributed system.

But AI is just weights, you can run a reasonably intelligent model at home, or on a few GPUs if you're a small-medium sized company, and it doesn't require dedicated maintenance.

cheema33 2 hours ago

> I got pressure from the CEO saying it wasn't reliable to host our own infrastructure anymore even though our total internal down time over the last 5 years is significantly less than even a single of the larger recent AWS outages.

Same here. My job as a software dev does not require me to self-host services we need and use. Quite the opposite. But, I am reluctant to hand over all control to AWS or equivalent for several reasons that I will get into here.

I have found that Infrastructure as Code (IaC) and modern tools like opentofu, ansible, combined with frontier AI models and harnesses gives you superpowers in this space. Almost all of our self-hosted services are fully managed by these tools. e.g. We perform backups and test them more often now than we ever did before. Entirely because it is so much easier to do all of that now.

derfurth 2 hours ago

That's an interesting take, however there is no ongoing maintenance related to local models, maybe the only effort is giving more capable machines to the workforce; but yeah I can see how it might feel like a barrier.

sathackr an hour ago

davidw an hour ago

Still though, perhaps the existence of low-margin, generic, cloud LLM's puts some downward pressure on the 'brand name' companies?

CamperBob2 15 minutes ago

outsource that headache along with the responsibility for it

You know what gives me headaches? When I'm in the middle of a session and the model gets rug-pulled out from under me because somebody at the model provider didn't pay the Trump bill that month.

Or when someone at the model provider decides that the curve-fitting algorithm in my graphics package looks a little too much like Skynet for comfort.

Or when they do any number of other things to undermine my work for the sake of their business model, some of which I won't even notice until the damage is done.

The sad thing is, if you know how inference works, you know that it really is insanely wasteful for everybody to run it locally. If anything naturally belongs in the cloud, it's inference. But at the same time, what choice are we being given?

indoordin0saur 2 hours ago

I'm curious when coding-heavy companies will start running their own on-prem AI clusters. Has anyone had the idea to sell something like 4 GPU machine an engineering team could throw in a closet somewhere and run whatever they want on it? I imagine this won't appeal to everybody but with the trust issues the hyperscalers have developed hoovering up people's data and using it to train their models, I imagine some will find value in a machine and model they have transparent control over including the option to walk over and unplug the thing.

CamperBob2 2 minutes ago

Has anyone had the idea to sell something like 4 GPU machine an engineering team could throw in a closet somewhere and run whatever they want on it?

I think that's basically Geohot's business model at Tiny Corp.

bityard an hour ago

The general consensus is that local models will continue to improve drastically, but hosted models will as well. There will _always_ be a pretty big gulf of capability between what you can do with a desk full of hardware at home vs a few racks of hardware in a datacenter. That seems to be the real "moat" of hosted models at this point in time: access to capital.

What's interesting/exciting is that local models are _already_ quite good at tasks we never imagined AI _ever_ doing before ChatGPT hit the scene just a few short years ago.

We're also in an interesting point in time where companies are releasing the fruits of their research/labor (the LLMs) to the general public for free. For now, I think they see it in their best interest to gain mindshare and rapport, as well as advancing the state of the art in smaller LLMs ("a rising tide lifts all boats") but I fear and expect that these will dry up as the major players buy the minor players, and all will seek a return on their considerable investments in AI research.

cogman10 44 minutes ago

I believe there's a level of diminishing returns. Sure, SOTA will probably always benchmark better than local models. But do we need it? That's the question that the likes of OpenAI and Anthropic should be worried about.

regularfry 23 minutes ago

storus an hour ago

They are working hard on you not being able to run a thing locally. OpenAI buys all RAM on the spot market, causing the rise of RAM/VRAM prices 6x, making GPUs and decent computers unreachable for the majority of the population. OK, some richer folks might be able to get a 512GB MacStudio or a single RTX Pro 6000 for 13k and be able to run some decent local models, but the vast majority will need to use API. And at some point Nvidia might say: "We don't sell that many 6000s, so let's just cancel them altogether as we can gain 4x profit on datacenter-only GPUs" and then they'll become unobtainium and no private person would ever be able to run anything decent (~1 year behind the frontier) locally.

wuliwong 2 hours ago

These local models can do some of the work the non-frontier models can do but for me, that's not worth much. If I am just using Sonnet 4.6, I can pretty much work all day on the $20/month plan. And Sonnet is still a way more powerful model than a one you could self host on an M2 mac.

If things change to token usage billing for everyone, maybe I'll be singing a different tune but on a subscription, I don't think it makes sense financially.

Fun? Yes. Financially sound? No.

icoder 2 hours ago

What I don't understand is that on one hand we read 'what they charge is much less than it costs them' and on the other hand this thread seems to suggest that 'what they charge is more than it would cost me'.

bluGill an hour ago

What it costs is tricky to measure. A large part of the costs are training the model. Once they have the model they are making a ton of profit from what they charge (or so we think - I haven't seen the numbers). However the sunk costs of getting the model need to be paid for and that means an accounting problem where we have to guess how much the model will be used in the future.

Accountants are reasonably good at figuring this out - there are a lot of different things that need a large upfront investment before you can charge anything. People still debate if they are correct in this each case.

esailija 2 hours ago

Bigger models that Antrophic want to sell cost disproportionately more (e.g. 100% more cost for 5% performance improvement) than small models you would use locally

themaninthedark 2 hours ago

Maybe that is why they are buying up as much hardware as they can? If their service is the only game in town.

otterdude 2 hours ago

Data Center providers are buying hardware, not anthropic. Certainly related but alot of the hardware purchased is just sitting in a warehouse waiting for a data center to get built.

sbmthakur an hour ago

Someone was able to run gemma-4-26B-A4B on an i5-8500 with 32 gb ram with NO GPU. Granted this is an extreme example these MoE models are value for money for a lot of use cases.

https://www.reddit.com/r/LocalLLaMA/s/YontVNVRbL

iagooar an hour ago

I love running two models locally: qwen3.6 27B 8bit (dense) and qwen3.6 35B 4bit (MoE).

The 27B is the smarter, more reliable one - but it is slower. The 35B is faster, still very smart but below 27B, a bit less reliable. The reason is the MoE - Mixture of Experts architecture, which only activates a subset of parameters, making the model much much faster.

I run the 27B on a MacBook Pro M5 Max + 40 GPU cores + 128GB RAM (well, on this beast I can have 27B + 35B in memory at the same time with headroom for all the other stuff). But because this is a laptop, it is not possible to run local LLMs all the time - it just gets too hot and too loud.

What excites me more: I run the 35B model on a MacMini M4 with 64GB RAM. It is fast, it gets a lot of work done (e.g. it scans, extracts and classifies my emails, it watches the mailbox all the time and does work). I also use it as my private Hermes assistant ("when is the next Starship launch?", "who is playing today at the World Cup? Give me some trivia").

Next step I am planning is a RTX Pro 6000 Blackwell workstation I can put in my basement. I want to run qwen really fast, with multiple threads / prompts / agents at once. And MAYBE if the budget allows, a 2x RTX Pro 6000 setup in order to run DeepSeek v4 flash on it (to run research on it).

zerd 23 minutes ago

I'd love an RTX 6000 Pro, but how can you justify it when it costs 10 years worth of Claude Max?

iagooar 21 minutes ago

10 years worth of Claude Max today. Also - Anthropic recently removed a model I relied on and isn't giving it back. As a non-US citizen, I would rather pay in advance but be sure, I will keep having access to inference on my own terms.

Also, it will just be faster - and more fun too.

Barbing an hour ago

Did you get a Brave search API key or something for that “Hermes”?

iagooar 20 minutes ago

Yes, Brave search is one of these services I highly recommend paying for, the search they provide (similar to Exa, Tavily) is what makes an "OK LLM" become super smart.

dghlsakjg an hour ago

Hermes is just an agent that can be setup for whatever you want (coding or more commonly personal assistant ala clawdbot). You can set it up with any of the standard tools and MCPs like brave or tavily for search.

embedding-shape 2 hours ago

Show us the resulting code of using them! :) I want to use local models, I have the hardware for it, but while trying them out as replacements for GPT 5.5 xhigh or Opus or other SOTA models, they aren't quite ready to be replaced yet, sadly. The quality and bumps they encounter just slows down the workflow so much, even screwing up tool call syntax sometimes.

But, for smaller more well-defined workflows, or as straight "edit this part to be like this exact" edits, they seem more than enough. Still waiting for them to become mature enough to be able to replace what we have as SOTA today, I'd say it's ready to be switched over then.

Speaking of local models, DiffusionGemma (and diffusion models in general) should not be slept on for local usage! Usually the problem locally is that the LLMs aren't efficiently making use of your hardware, unless you start batching requests and run many at the same time, but that require different approaches in general. Instead, diffusion models work much faster for individual prompts, and not by a small margin either.

Today I finally finished porting diffusiongemma-26B-A4B-it support from Transformers into Candle, and together with some optimizations I now have it basically flying with ~450 tok/s (~19 it/s) in Candle during inference, instead of ~180 tok/s (~11 it/s) from HF's Transformers library. Even using vLLM with similar sized LLMs, I don't think I've ever gotten past the ~250 tok/s threshold for single prompts, exciting stuff for local models :)

zozbot234 2 hours ago

> Instead, diffusion models work much faster for individual prompts, and not by a small margin either.

Diffusion models can't really be trained beyond low-to-mid size and have lower quality than an equally sized, plain one-token-at-a-time model.

embedding-shape an hour ago

As mentioned, I've just finished the implementation and started playing around with it, seems to be doing similarly well inside of my own agent harness as similarly sized "traditional" LLMs. Of course, neither come close to SOTA models, but I suppose if we can figure out the scaling issues you mention, we'd get a bit closer. The performance just feels like it's too good to quickly ditch diffusion. Do you have more info what those "can't be trained beyond low/mid size" issues are in practice today?

zozbot234 an hour ago

k__ 3 minutes ago

I tried some smaller Gemma4 and Qwen3.6 quants on my MBA with M5/16GB and had like 20-60 tokens per second. At 60 it felt pretty okay and that hardware is on the lower end.

I'd assume a Mac with 32-64GB memory would get some reasonable results.

sosodev 2 hours ago

I think this is overselling their capabilities. I've used Gemma 4 and Qwen 3.6 quite a bit on my strix halo home server. They're great models and the dense variants are significantly better, but they're still very far behind the frontier. If you boot up Gemma 4 MoE and OpenCode/Pi and expect to perform anything like Claude Code or Codex you're going to be very disappointed.

abalashov 16 minutes ago

And if you want to dial in a setting in between: I've switched to Kimi K2.6 (now K2.7) and DeepSeek through OpenRouter and Reasonix for pretty much everything, with no discernible loss of analytical quality or utility.

However, like many commenters, I don't really believe in vibe-coding, long-horizon agentic one-shot agentic coding, etc. and do not use LLMs for huge generation tasks that involve designing things end-to-end.

I also have an MBP with 128 GB of unified memory and do quite a bit of Qwen3.6-35B-A3B. No, it's not as smart as the aforementioned models, to say nothing of frontier, but many people seem pleasantly shocked by the number of banal tasks that do not require these.

segmondy an hour ago

It's more than good. As of today, it's great. Those models listed in the blog are horrible compared to what you can run today, There's absolutely no reason to run those, you have Qwen3.6, Gemma4, and plenty other sized comparable models.

If you're resourceful, you can even run SOTA models. KimiK2.7, MiMo-V2.5/V2.5-Pro, MiniMax2.5/2.7/3, DeepSeekV3.1/v3.2/V4-Flash/V4Pro, GLM5.1, Step3.7-Flash, Qwen3.5-397B, Qwen3.5-122B, gpt-oss-120B

chrismarlow9 2 hours ago

You can use a frontier model to create a plan that's specific enough for a local model of a very small size to execute on. The more specific you are and compartmentalize tasks the "dumber" the local model can be.

Edit: Obviously you'll be using more tokens but this is the trade off for running a smaller model and running locally. Similar to time memory trade off but in token economics. Sorry I need more coffee

_doctor_love 2 hours ago

"Just get a 64GB Mac with 1TB of storage!"

LOL - some of us have a budget

swatcoder 2 hours ago

Sure, but it's also not really out of scale with the cost of a shop tool in other trades.

If you're a professional that's confident in a positive return on the investment (optimal or not), or just a hobbyist with the luxury budget for a "shop" that cost is well within norms.

That's not everybody, of course, but it's not some inconceivable fantasy. A lot of people in the tech community here on HN, specifically, end up with pretty high discretionary budgets that they pour into stuff like this.

amalcon 2 hours ago

A Strix Halo with similar RAM is considerably cheaper. Still not cheap, mind, but performance is OK (not great) and it will run more or less the same models.

AbsurdCensor 2 hours ago

At least for me, it's been pretty great, but I bought my system when it was $1800, now looks like the same system is $2700 and out of stock. I still haven't quite been able to run 120B parameter models under Windows, but for Qwen Coder 30B, it works pretty darn well for my at home needs.

amalcon 2 hours ago

techscruggs 2 hours ago

He is using a 2022 M2, which you can get that for about $2k used. That is beyond reasonable.

Shekelphile 2 hours ago

She

psychoslave 2 hours ago

Global Affordability Estimate:

Top 10% of global earners (~800M people) can afford a $2,000 device without major financial strain.

Top 25% (~2B people) could afford it with some budget adjustments.

Bottom 50% (~4B people) would find it prohibitively expensive.

So for a SV top income, maybe that might look more like the weekly pet brushing budget, but for most people out there this is not that much of a no-brainer.

disgruntledphd2 2 hours ago

richwater 2 hours ago

p-e-w 2 hours ago

No need. You can run the Gemma 4 and Qwen3.5 MoE models with as little as 12 GB of VRAM at 30-40 tps (Q4/Q5), and they both blow GPT-4o and DeepSeek R1 out of the water.

tjwebbnorfolk 2 hours ago

AI and budgets don't mix well at the moment

themythfable 2 hours ago

Yeah, I never had a computer that cost north of $800 until recently. While that is far from the typical HN user's budget, my bet is that it is much closer to average.

Besides those with effectively unlimited budgets for their personal compute, local models are still a long ways off.

Though, that shouldn't be conflated with the value of open-source models, which can be used by cloud providers to significantly reduce cost of intelligence.

embedding-shape 2 hours ago

> Yeah, I never had a computer that cost north of $800 until recently. While that is far from the typical HN user's budget, my bet is that it is much closer to average.

There are segments, everything from "Average person in world" to "Average creative professional using computers for work" and more on HN, with a wide range of costs for the hardware. HN probably skews towards the latter rather than the former, probably sitting with enterprise hardware next to them basically for fun, hard to make wider conclusions from what people here have or not.

sublinear 2 hours ago

If we define "typical" as the median HN budget, it's probably about the same as yours. Maybe the answer would have been different 10 or 20 years ago, but the era of truly needing a big budget PC has been over for a while.

It's just for gaming and AI now. Maybe not even gaming as much anymore.

Consider the perspective of someone who has a practically unlimited budget for PCs, doesn't game much anymore, and doesn't need AI to do their job. It's just part of getting older, and there are plenty of people in their late 30s and older on here.

anarticle 2 hours ago

Pros buy their own tools. This is why working for yourself is better than working for a corpo, you get to choose your weapon.

huydotnet 35 minutes ago

I love that local LLMs are being discussed more often on HN recently. But for the post, I find it strange that the author claimed they were working with local models from day 1, but wrote a post that still links to Qwen2.5 and Qwen3 in mid June 2026.

0xc0c0c0 2 hours ago

I have used local models (around 128 gb) and the big proprietary models, and while I do want local models to win, it's important we keep the expectations of local models realistic. There are many blog posts about how local models today can fully replace some of the proprietary models and in some cases its true for the much smaller proprietary models, its very clearly much more behind the larger models.

You can be far more ambiguous with your tasks with the larger proprietary models as opposed to the local models. You can achieve the similar results with local models but you need to be much more detailed in your prompt.

One of the biggest things about running these local models is that the harness matters almost just as much as the model too. Codex is optimized for GPT models, CC is optimized for Claude, Cursor has a great harness that works very well across these providers. It took me a couple of iterations of the different harnesses to find one that would work well with the smaller Qwen models to do local coding.

failbuffer an hour ago

So which harness did you end up choosing?

dejawu an hour ago

If vibe-coding is hopping into a self-driving car and telling it to take you anywhere you can get a coffee, then I use coding agents more like a bicycle - they let me get further faster than if I'd walked, but I still have to decide where to go and how to get there, and I still have to pedal.

I don't vibe-code, but I do decide what to implement and what patterns to use (perhaps asking the model to analyze and give advice on this first), then I have it handle the nitty-gritty of the implementation itself. For this usage style, the latest local models are as good as having Claude at home.

I won't say it's been _easy_ (I ended up implementing my own harness to accommodate the idiosyncrasies of local models), but I will say that for the effort, having a coding agent that's essentially free to query as much as I want has been life-changing as a dev, especially when it comes to working on side projects. Knowing that my agent will never get worse in quality, suddenly cost more than it does now, or be suddenly made unavailable by external factors, was absolutely worth the trouble. And on top of all that, I can't believe it's as good as it is.

ngxson an hour ago

My 2c: I think the "cloud vs local" debate is (maybe) a false dichotomy. In my experience, I use a hybrid approach and I've seen a huge productivity boost from it.

The cloud-based models are fine for big and complex tasks, but the pricing is ridiculous for small stuff—like summarizing a discussion or fixing a small bug. And cloud and privacy have never been a good match.

As an example, this comment itself was written with the help of Qwen3.5-4B running locally with an extension on top of llama.cpp default web UI [1]. The extension injects my browser's context directly into the conversation, which allows me to summarize things and draft up comments quickly. Speed is pretty acceptable for the size: ~5s TTFT and ~100 t/s generation, all running on a Macbook M5.

And when I want to run bigger tasks, I don't just stick to one provider. Apart from well-known closed-weight providers like OpenAI or Anthropic, I also experiment with open-weight models like GLM-5.1, DeepSeek V4, and Qwen3.6-27B, which provide quite good results for the price.

I'd argue both have value, and I don't see why anyone needs to choose one exclusively. Anyone else doing this?

[1]: https://github.com/ngxson/llama-companion

phainopepla2 42 minutes ago

Why not just use DS V4 Flash for the small stuff? Very fast and extremely cheap.

ngxson 31 minutes ago

The dsv4 flash is 158B params in total. It is possible to run locally but will require all my system RAM.

Also, a lot of my day-to-day tasks perform the same on both small and bigger models: summarize a web page, draft a response, translations, quick web search, etc.

phainopepla2 7 minutes ago

ltononro an hour ago

Good depends a lot. If you are in the token maxxing hype you will probably find these models very bad comparing to SOTA, unfortunately.

The good news might be: opensource models are now good (enough) for day2day usage. But is it really? I feel that companies will always naturally strive for the best and use the SOTA (as long it is not too expensive).

I see OSS models being a good backbone for companies in the future that have validated workflows and could use those for privacy or to spare costs.

IDK, might have gone a little bit off-topic here.

valisvalis 39 minutes ago

There are good use cases for them for sure, the Gemma 4 Good hackathon a while ago showed how local models can solve problems in health and education in areas with low connectivity or small infrastructure.

Tharre an hour ago

I've been running Qwen3.6-35B-A3B (and 3.5 previously) locally and it's a great model for many small tasks, probably a significant chunk of what most normal people are using LLMs for right now.

But for coding in a harness? In my experience it's unusable even for small projects. It just gets hard stuck at every little problem, wasting hundreds of thousands of tokens trying to make a convoluted solution work instead of doing the obvious thing. Or it will spend hours trying to reason through a fairly simple code flow, incrementally adding debug print statements, only to get confused by the output and then editing completely unrelated code that it convinced itself is the problem.

I've tried instead giving Sonnet the problem description and code and have it come up with a detailed plan that Qwen should implement, but doing that actually consumes a significant amount of tokens compared to just telling it to implement everything, and the results are honestly not that much better. There are just too often subtle issues with the plan that Qwen doesn't recognize when implementing, but make the resulting solution it comes up with unusable.

richbradshaw 2 hours ago

I’m keen to understand speed here etc etc. if I bought a Mac studio with 96GB - what can I realistically run, how’s it compare to fable/opus etc and how fast is it?

Currently maxing out two Claude code accounts every x hours when working on large code migrations or setting up new iOS apps etc - most of time it’s fine but occasionally it’s mega frustrating!

simonw 2 hours ago

I strongly recommend trying LM Studio - it's the lowest friction way to try out models, you can browse https://lmstudio.ai/models and click "Get" and then "Run in LM Studio" to download and run a model.

With 96GB I'd start with the Gemma 4 and Qwen 3.6 models. Any of those should work fine.

AbsurdCensor 2 hours ago

I think currently you can only get the M3 Ultra Studio with 96gb, and for coding tasks, say you rub Qwen Coder on it (which doesn't need that much ram), it's not the fastest, something like 30-40 tok/sec. Probably better with a MacBook Pro with the M5 chip. There is a website for comparing different configurations and models: https://llmcheck.net/benchmarks

simonw 2 hours ago

I think gemma-4-26b-a4b and Qwen3.6-35B-A3B show that there's something very interesting about a local model that does mixture-of-experts (which helps a lot with performance) and has in the order of 30 billion parameters.

These models are very capable, and use around 20-30GB of RAM while they are running.

Provided you have 64GB of RAM that leaves space for running other applications at the same time.

chrisweekly 2 hours ago

Obtaining that 64GB RAM is a meaningful obstacle for many.

simonw an hour ago

I'm still amazed that you can run LLMs of this quality on a machine that costs less than $3,000.

I used to assume that anything GPT-4 equivalent or higher would need $30,000+ of server-class hardware.

That said... gemma-4-12b-qat is 7.15GB on disk so should run reasonably well in 16GB, that takes it down to MacBook Air territory https://lmstudio.ai/models/google/gemma-4-12b-qat

throwarayes an hour ago

I am happy to pay OpenAI for a cheaper model a few generations behind. But they deprecate models aggressively. They push you to bigger and smarter models, when 95% of my work doesn’t need it.

I’d love it if model providers just let old models run and let us pay less, but the deprecation makes me want to look into local models.

wxw 2 hours ago

> “if we are constrained by performance and price, what architectural tradeoffs do we need to make?” a question that so far has not really been asked in the mad token gold rush.

To be fair, I think the labs are also interested in this (e.g OpenAI parameter golf). But the incentives are tricky. When the subsidies and tokenmaxxing era ends, local models will be essential.

aliljet 2 hours ago

The problem here is always the cost-benefit. For $200/mo, you're receiving subsidized best of breed access. There's no model competing for that price anywhere. If a 27B param model is what you choose, show me your hardware! I would love to be wrong...

rsolva an hour ago

But for how long? The subsidized phase is probably short, and then what? I run Qwen 3.5 27 Dense om my old AMD RX7900XTX at about 45 t/s and barely use my Claude Code subscription anymore.

anubhav200 2 hours ago

I have been using qwen and glm based models from last 2 years, ended up buying mutiple machines for the same. Overall i feel 24vram is a must have to get get performance (speed wise) to match hosted soln. I have 2 machines a 12gb vram one and a 24gb one. On 12gb vram i get around 50tps generation and 500tps prompt processing and on 24gb one i get 180tps generation and 3500tps prompt processing. I have different configs for different scenarios and I also use llama cpp manager manage all my configs (https://github.com/anubhavgupta/llama-cpp-manager)

jotato an hour ago

I currently have a desktop with a 4060 ti (16gb of vram). Most models I have tested that fit within that are not good enough for anything other then type completion (in regards to coding tasks)

I have been considering getting the 58gb Mac Mini but that is a decent amount of money to spend without confirmation on a) how fast is it and b) will it work for well-defined tasks.

wrxd an hour ago

I wonder how much local models hallucinate. I am getting almost daily an "Honest answers: I made that up." reply from Claude Opus when I challenge some silly thing it's trying to do.

cautiouscat 2 hours ago

> I have no concrete scientific evidence of this - my own personal vibe metric of “is a model good enough” is, “do I have to double-check it against an API model”, and GPT-OSS was the first one where I started doing that a lot less often.

The good old butt dyno!

I’ve been eyeing local models more and more with Anthropic squeezing more and more on the subscriptions. A few comments on HN had me waiting until they improved more but this article makes me wonder if I should reconsider that.

I’ve been doing some pretty niche development using a game and a script extender for said game. If these models can handle that, I’d feel good about switching.

prlin an hour ago

If you wanted to do some research or learn about post training and agent harnesses, is that a good option with these local models? What hardware is recommended, or easiest to go with a Mac Studio with 64GB+ RAM?

malkosta an hour ago

The problem with QWEN is that it just can't edit files reliably, I had to hack Pi all over to reduce the pain, but still far from perfect...does Gemma 4 strugle on this?

fridder an hour ago

Is there a local harness designed around the local model use case that is claude code like? Opencode has been problematic at times, pi works for one off for me but not back and forth conversations with the LLM. Considering I only use Qwen or Gemma models I'm close to just writing my own at this point

fg137 an hour ago

> I have a 2022 M2 Mac with 64 GB RAM

I closed the article after that.

The author has no idea what a privilege it is to have a machine like that for personal use, and how 99% of the population are not going to afford a setup like that.

Just some back-of-the-envelope maths will tell you that a $20/month Claude subscription makes much more sense financially.

orf an hour ago

99% of the population don’t code using models, local or remote. So that’s a useless metric.

What % of developers could afford an older MacBook model, second hand? Far, far more than 1%.

cube00 2 hours ago

The challenge I have is getting a large enough context window so tool calls work reliably, the local models easily slip into hallucinated JSON tool responses and won't trigger the tools as a result.

glaslong an hour ago

Same here. I'm curious what others loving Qwen are doing differently, because it constantly hits this issue for me. It's been great for autofilling blocks, but difficult for me to use agentically.

daniban an hour ago

With Apple silicon and now the RTX Spark there are real discussions whether local AI is the future. The only problem is Western open source models are so far behind. I genuinely feel there's a push to fix this. Gemma is getting more frequent releases and Nvdia is quietly creating very cool small models. I hope both the hardware and models catch up and local really does emerge.

anax32 2 hours ago

I've just made a milestone on my project, moving away from AWS (budget) to self-hosted and the local models are so much faster than in the past. Beyond LLMs, having embeddings, image, video, audio gen available is crazy.

Running locally is the bar; it's hard to make these things a service which scales.

fl4regun an hour ago

In my experience, with a system of 32GB RAM and 24GB VRAM, no, they aren't that good.

drchaim an hour ago

really want to try local models, but I don't have the hardware yet. Probably I'm the only one here still using a Mac Mini m1 8gb 2020. :/

ibizaman 2 hours ago

Tangential but reading on mobile, the font size in the code snippets are all over the place. I actually have the same issue on my blog. Anyone knows why?

jingw222 15 minutes ago

open source must win

ZionBoggan 28 minutes ago

This is actually a really insightful post !

stared 2 hours ago

I really recommend Qwen3.6 27B.

Make some tests, and its 8 bit version runs at 30tok/s when using llama.cpp with MTP and run on Macbook Max M5. I have 128 GB, but but 64 GB is well enough. https://github.com/stared/benching-local-llms-on-apple-silic...

When using benchmarks, it gives more-or-less the level of SotA mid-late 2025.

iagooar an hour ago

I run the exact same model, on the exact same hardware - amazing results. Pair it with good search skills (Tavily, Brave, Exa) and you have a near-SOTA model on your desk.

wizzledonker 2 hours ago

Did you mean 2025?

stared 2 hours ago

Yes, fixed

wasimxyz 2 hours ago

xienze 2 hours ago

The big caveat here is that these local models require you to invest some time tweaking your harness, AGENTS.md, and skills in order to get things roughly to the level you'd expect. But something like Qwen3.6-27B with web search capabilities and a good set of skills really is impressive! Especially considering that you can go wild and not worry about token costs.

The other thing that people tend to gloss over is that you really do need to spend some $$$ on decent hardware. Yeah, you CAN run some 4-bit quant with heavily quantized cache on your 16GB card, but it's not going to be a great experience (I think this is where a lot of the "if you think it's gonna be any good, you're going to be disappointed" stuff comes from). Yes it's a lot of $$$ upfront but it's very much unknown when hardware prices are going to come back to reality. There's a lot of hopes and dreams that any minute now an H100 will be worth pennies because "that's how it's always been" w.r.t. computer hardware, but we are living in interesting times. So you can't just make the tired old assumptions that a Claude subscription over three years time will work out to be dramatically less than the value of some card three years from now. We STILL have basically anything with >=24GB VRAM appreciating in value, which is absolutely wild. What I'm saying is, the depreciation curve may very well be a lot less dramatic and fast than it used to be, going forward.

monegator an hour ago

I've been trying local models for the boring stuff you might be thinking about: writing small docs.

So i've tested a couple, and the speed is finally impressive. My colleague uses paid tiers of claude and GPT, and the speed is comparable. Maybe even slightly faster on my end.

The problem is: i'm running the model on my work laptop, a 12th gen i5 with 16GB of RAM (which, you know, i asked to upgrade to 64, but that was right at the time of the great RAM shortage of the '20s) so i'm pretty limited in what i can use. And this is running alongside the usual suspects: Web browser hugging 1.5GB, MPLABX hugging 3, windows taking at least 5 just to sit idle, thermal throttled to 1GHz ... And yet its speed is comparable to a paid service. A lunch's worth of tokens vs a few cents of power.

So, what i found, what i fount... What i found is that i need AT LEAST 16k of context window, otherwise they will halt when i pass a small C file for analysis. And coding models will shit the bed with 4k. But we all know that, context size is King.

I found out that Qwen will keep looping while thinking, but that's not a surprise to you, either. But give it enough time and you will get an useful answer. I was hoping to using it as a better warning system for some languages, but i fear i need muuuch more context size, because i tried to feed a file that had a function with an endless loop:

At 4k context it almost shit the bed if i gave it just the offending function, then told it where to look at. At 16k context, with the whole file, it needed some guidance to what the problem was, and after 10-15 minutes of thinking it found the issue. Problem is, it kept second guessing itself for another 20 minutes on the same unrelated thing before giving the output. For which the fix was wrong, but the semanthic was correct. Good enough. Maybe it will be faster if i don't ask for a fix (which i didn't i just asked to look for a specific issue)

Wish i had 3 times the RAM so i can see what happens with more context.

Then i gave it the task to analyze a C file to make an API document. It took half an hour, but then i had a good starting point, which i had to keep changing because it would confuse commands with IDs and things like that.

This was the Qwen 3.5 9B model.

I then tested Gemma 4, being impressed at the tokens per second it gives on my Pixel 8A. Same tasks: same issues with short context, with long context it gave absolutely useless answers when looking at code, but it took 1/3 the time of qwen.

In producing documentation, instead, it was much faster, and it never hallucinated data. Good. in 15 minutes i had everything done.

Not bad for stuff running on a business laptop, while doing actual work.

Tomorrow i will try Qwen 3.6, let's see how it goes..