$500 GPU outperforms Claude Sonnet on coding benchmarks (github.com)
452 points by yogthos a day ago
bloppe 13 hours ago
Generating big chunks of code is rarely what I want from an agent. They really shine for stuff like combing through logs or scanning dozens of source files to explain a test failure. Which benchmark covers that? I want the debugging benchmark that tests mastery of build systems, CLIs, etc.
bartread 9 hours ago
I agree. Also good for small changes that need to be applied consistently across an entire codebase.
I recently refactored our whole app from hard deletes to soft deletes. There are obviously various ways to skin this particular cat, but the way I chose needed all our deletions updated and also needed queries updating to exclude soft deleted rows, except in specific circumstances (e.g., admins restoring accidentally deleted data).
Of course, this is not hard to do manually but is is a bloody chore and tends toward error prone. But the agent made short work of it, for which I was very grateful.
CraigJPerry 8 hours ago
Do you not end up breaking half the value of referential integrity doing it that way (e.g. you had to update all the queries but now you have a sharp edge in that all future queries need to remember to be soft delete aware. Not a blocker for sure, just a sharp edge).
You know your system better than me for sure, a random commenter on a website :-D your comment just shocked me out of my daze enough for my brain to say "but I always move the record to another table rather than soft delete" and i felt compelled to give unsolicited and likely wrong opinion.
bartread 4 hours ago
andyferris 8 hours ago
dakolli 4 hours ago
must be something incredibly simple you're making out more complicated than it actually is, I've never seen an LLM do these things well.
bartread 4 hours ago
sigmoid10 11 hours ago
Probably want to look at SWE bench pro or terminal bench 2. They cover these longer horizon tasks that need more than just writing a bit of code in one file. And SWE bench pro in particular it is not yet saturated like many other common benchmarks. Normal SWE and LCB are not really useful anymore because they are already being gamed hard so the developers can quote high numbers in a repo readme or press release.
jakozaur 8 hours ago
Build systems are tested by CompileBench (Quesma's benchmark).
Disclaimer: I'm the founder.
slashdev 7 hours ago
Generating big chunks code is all I do, all day.
I don't write code by hand any more, neither at work, nor for side projects.
I work mostly in Rust and TypeScript at a developer tools company.
Bombthecat 10 hours ago
Oh yes! I let my environments now be built by agents via kubectl / helm and let them debug issues.
It's amazing! Saves hours of work!
I create the basic helm configd settings etc and when there is a conflict or something not working I let an agent fix it!
seunosewa 7 hours ago
Create it!
mmaunder 18 hours ago
I’d encourage devs to use MiniMax, Kimi, etc for real world tasks that require intelligence. The down sides emerge pretty fast: much higher reasoning token use, slower outputs, and degradation that is palpable. Sadly, you do get what you pay for right now. However that doesn’t prevent you from saving tons through smart model routing, being smart about reasoning budgets, and using max output tokens wisely. And optimize your apps and prompts to reduce output tokens.
vidarh 8 hours ago
I get decent results with Kimi, but I agree with your overall premise. You do need to realise that while you can save money on a lot of tasks with those models, for the hardest tasks the "sticker price" of cost per million tokens isn't what matters.
It's also worth noting that the approach given in the link also benefits Sonnet and Opus. Not just as much - they are more forgiving - but put it in a harness that allows for various verification and repair and they too end up producing much better results than the "raw" model. And it's not clear that a harness around MiniMax, Kimi, or Qwen can measure up then.
I use those models a lot, and hope to use them more as my harnesses get better at discriminating which tasks they are cost effective for, but it's not straightforward to cost optimize this.
If I cared about running everything locally, then sure, it's amazing you can get to those kinds of results at all.
thefourthchime 16 hours ago
I won’t use anything less than the SOTA. It tried using Opus 4.6 medium and immediately regretted it. High messes up enough.
overfeed 14 hours ago
What were you using 6 months ago?
withinboredom 13 hours ago
rf15 13 hours ago
You cannot afford the SOTA.
weird-eye-issue 13 hours ago
XCSme 18 hours ago
Yup, they do quite poorly on random non-coding tasks:
https://aibenchy.com/compare/minimax-minimax-m2-7-medium/moo...
rmi_ 12 hours ago
Wild benchmark. Opus 4.6 is ranked #29, Gemini 3 Flash is #1, front of Pro.
I'm not saying it's bad, but it's definitely different than the others.
XCSme 11 hours ago
usagisushi 15 hours ago
Interesting benchmark. It is notable that Gemini-3-Flash outperforms 3.1 Pro. My experience using Flash via Opencode over the past month suggests it is quite underrated.
Needless to say, benchmarks are limited and impressions vary widely by problem domain, harness, written language, and personal preference (simplicity vs detail, tone, etc.). If personal experience is the only true measure, as with wine, solving this discovery gap is an interesting challenge (LLM sommelier!), even if model evolution eventually makes the choice trivial. (I prefer Gemini 3 for its wide knowledge, Sonnet 4.6 for balance, and GLM-5 for simplicity.)
wizee 16 hours ago
It’s worth also comparing Qwen 3.5, it’s a very strong model. Different benchmarks give different results, but in general Qwen 3.5, GLM 5, and Kimi K2.5 are all excellent models, and not too far from current SOTA models in capability/intelligence. In my own non-coding tests, they were better than Gemini 3.1 flash. They’re comparable to the best American models from 6 months ago.
vidarh 3 hours ago
XCSme 16 hours ago
redoh 7 hours ago
raincole 10 hours ago
I can't imagine anyone looking at this benchmark without laughing. It's so disconnected.
scotty79 10 hours ago
GLM 5 here is significantly better than GPT-5.4
comboy 11 hours ago
Not really related, but does anybody know if somebody's tracking same models performance on some benchmarks over time? Sometimes I feel like I'm being A/B tested.
XCSme 11 hours ago
miroljub 11 hours ago
> I’d encourage devs to use MiniMax, Kimi, etc for real world tasks that require intelligence.
I use MiniMax daily, mostly for coding tasks, using pi-coding-agent mostly.
> The down sides emerge pretty fast: much higher reasoning token use, slower outputs, and degradation that is palpable.
I don't care about token use, I pay per request in my cheap coding plan. I didn't notice slower outputs, it's even faster than Anthropic. Degradation is there for long sessions with long contexts, but that also happens with Anthropic models.
> Sadly, you do get what you pay for right now. However that doesn’t prevent you from saving tons through smart model routing, being smart about reasoning budgets, and using max output tokens wisely. And optimize your apps and prompts to reduce output tokens.
Exactly. For my use case, I get 1500 API requests every 5 hours for 10€ monthly. I never hit the limit, even during the intensive coding sessions.
What I notice is, while Opus and Sonnet feel better for synthetic benchmarks, it doesn't matter in the real world. I never put so much effort into coming up with a perfect problem spec like the ones in benchmarks. I don't craft my prompts for hours expecting the LLM to one-shot a working program for me. And that's exactly what all those benchmarks are doing. And that's where Anthropic tools shine in comparison to cheaper Chinese models.
When it comes to the real world, where I put my half-baked thoughts in broken English in a prompt and execute 20 prompts in half an hour, the difference between Opus, Sonnet, and MiniMax is minimal, if at all. There, I don't want to think about costs and token savings and switching between different Anthropic models. I just use MiniMax, and that's it.
Yes, MiniMax sometimes gets stuck. Then I switch to Opus to unblock it. But the same happens if I use Opus the whole session. It gets stuck eventually, and model switch is sometimes required to get a fresh perspective on the problem.
The only difference is, using Opus or Sonnet quickly eats up my budget, while with MiniMax I have basically unlimited usage (for my coding use case) for 10€ per month.
tim-projects 10 hours ago
I've only been using free tokens for a year now. Gemini and they just dropped pro so I switched to minimax. Bit of a hurdle switching from Gemini-cli to kilo-cli, but now I can't really see too much difference.
If I was starting new projects I'd pay for a better model, but honestly I don't really know any different.
I've not ever used Claude and people seem to rave about it. Maybe its good, but I doubt its $200/month good.
When I hit issues with these lower models I think hard about creating the right tooling - agnostic to the harness and I feel like maybe its more work but I can carry those tools to any setup going forward. That's how it was in the early Linux days so why change what clearly works?
bethekind 5 hours ago
mongrelion 9 hours ago
What is this 10€ per month subscription that you are talking about?
harias 9 hours ago
moffkalast 11 hours ago
Kimi's been one of my goto options lately and it oftentimes outperforms both Claude and GPT in debugging, finding the actual problem immediately while the other two flail around drunkenly.
It does have some kind of horrible context consistency problem though, if you ask it to rewrite something verbatim it'll inject tiny random changes everywhere and potentially break it. That's something that other SOTA models haven't done for at least two years now and is a real problem. I can't trust it to do a full rewrite, just diffs.
smokel 11 hours ago
And what tooling do you use with that? In my experience, there is quite a bit of difference between using, say, OpenCode, or the commercial offerings.
moffkalast 11 hours ago
m00x 13 hours ago
Minimax 2.7 is fine for most web stuff. It's slightly worse than Claude at backend, but works great for frontend.
They're all slop when the complexity is higher than a mid-tech intermediate engineer though.
Leynos 12 hours ago
Kimi is surprisingly good at Rust.
dvt 13 hours ago
> They're all slop when the complexity is higher than a mid-tech intermediate engineer though.
This right here. Value prop quickly goes out the window when you're building anything novel or hard. I feel that I'm still spending the same amount of time working on stuff, except that now I'm also spending money on models.
stuaxo 10 hours ago
victorbjorklund 11 hours ago
yea, they are still useful. But yea not close to Claude or GPT. But works good for simple changes. I use a combo of minimax and codex
mkw2000 14 hours ago
i find kimi to be very very good, minimax not so much
paulddraper 14 hours ago
Agreed.
They are equivalent of frontier models 8+ months ago.
selcuka 19 hours ago
It's a race to the bottom. DeepSeek beats all others (single-shot), and it is ~50% cheaper than the cost of local electricity only.
> DeepSeek V3.2 Reasoning 86.2% ~$0.002 API, single-shot
> ATLAS V3 (pass@1-v(k=3)) 74.6% ~$0.004 Local electricity only, best-of-3 + repair pipeline
strangescript 6 hours ago
I will "suffer" through .004 of electricity if I can run it on my own computer
sourcecodeplz 16 hours ago
I've tested many open models, Deepseek 3.2 is the only SOTA similar.
yogthos 18 hours ago
You could use this approach with DeepSeek as well. The innovation here is that you can generate a bunch of solutions, use a small model to pick promising candidates and then test them. Then you feed errors back to the generator model and iterate. In a way, it's sort of like a genetic algorithm that converges on a solution.
hu3 16 hours ago
Indeed but:
1) That is relatively very slow.
2) Can also be done, simpler even, with SoTA models over API.
yogthos 16 hours ago
eru 11 hours ago
Why do you need a small model to pick promising candidates? Why not a bigger one?
(And ideally you'd probably test first, or at least try to feed compiler errors back etc?)
Overall, I mostly agree.
yogthos 6 hours ago
alifeinbinary 4 hours ago
All those parameters and it still won't answer questions about Tianenman Square in 1989... :(
viktorcode 3 hours ago
It will. The web chat has censorship features, but the model you can download doesn't.
mikestorrent 18 hours ago
> cheaper than the cost of local electricity only.
Can you explain what that means?
simonw 18 hours ago
I think they mean that the DeepSeek API charges are less than it would cost for the electricity to run a local model.
Local model enthusiasts often assume that running locally is more energy efficient than running in a data center, but fail to take the economies of scale into account.
BoredomIsFun 10 hours ago
jacquesm 16 hours ago
croes 14 hours ago
littlestymaar 16 hours ago
pbhjpbhj 6 hours ago
atoav 17 hours ago
It means that the electricity you would have to pay if you did the computations yourself would be more expensive than paying them to do it. Part of thst has to do with the fact that China has cheap electricity, also due to their massive push into renewables. Part of that is just economies of scale. A big server farm can run more efficiently than your PC on average.
AuthAuth 16 hours ago
jojobas 18 hours ago
China has cheap electricity.
ericd 18 hours ago
DeathArrow 11 hours ago
DanielHall 9 hours ago
These small models, having been fine-tuned for the test, achieve frighteningly high scores, yet perform abysmally in real-world scenarios.
memothon a day ago
I'm always skeptical because you can make it pass the benchmarks, then you use it and it is not practically useful unlike an extremely general model.
Cool work though, really excited for the potential of slimming down models.
kimixa 16 hours ago
I find it's often very language and sector dependent. I still see a massive difference in systems programming (normally c++ and rust) between any open model I've tried and something like sonnet 4.5 (not really tried 4.6). And honestly, even the big models (like Opus 4.6) struggle in many cases.
Perhaps these things aren't well represented in the training data for these open models? Every local model I've tried (minimax2.5, GLM-4.7, Quen3, 3.5 and -coder variants) spend so much time trying to get something syntactically sensible and accepted by the compiler that when they've finished they barely seem to have any "momentum" left to actually solve the problems, as pretty much anything but the most trivial change ends up in another loop of actually trying to get it working again, often losing the intent of that change in the process.
My fear is that the solution here, having multiple instances all making the same changes for later comparison, would spend a huge amount of time beating it's head against compiler errors, types, memory allocation (NO DON'T JUST SPRINKLE IN A FEW MORE RAW "new" KEYWORDS DAMMIT) before it even gets to the "logic".
Having plenty of local GPU power I'd love to be able to actually use that, and I'm already wary about some of the training data use and it's interactions with the license of the code I'm "sending" to the cloud models...
vidarh 2 hours ago
> Perhaps these things aren't well represented in the training data for these open models
I know from first-hand experience that at least a couple of the SOTA providers use third-party providers for supervised finetuning with instructions that are heavily geared towards a specific set of languages as well. But of course the base dataset from the major providers is likely to be sufficiently better that it matters less, and the big models are good enough at carrying over training that it at least seems like extra training on the core languages they care about at least somewhat carries over (you see this with natural language too - they do really well for many minor languages that make up a miniscule proportion of the training data).
(I won't say much more regarding the SFT/RLHF work due to NDAs - plural; I know who one of the providers is; I don't know who the one or more others are as the intermediary I did some work for obscured it well enough that I couldn't really violate the NDA even if I wanted to)
yogthos 21 hours ago
You obviously have to try it out to see how it works for you, but the trick they use is pretty clever. When you ask an AI to write code, it doesn’t always get it right. Sometimes the code has bugs, sometimes it misunderstands the problem entirely. A naive way to address that is to generate a few solutions and test each one. The odds that at least one works go way up. ATLAS generates multiple attempts, running each through a test suite. Each retry also gets told what went wrong with the previous attempt, so it can try to avoid the same mistake.
But this can be pretty slow since you have to run the code in an isolated environment, check the outputs, wait for it to finish. Doing that for every candidate quickly adds up. So ATLAS has another shortcut for avoiding unnecessary testing. Instead of simply generating solutions and testing all of them, it tries to predict which one is most likely correct before running any tests.
ATLAS also asks the model for an embedding of what it just wrote which acts as a fingerprint. Two similar pieces of code will produce similar fingerprints. A well-written, confident solution will produce a different fingerprint than a confused, buggy one.
These fingerprints get fed into a separate, much smaller neural network called the Cost Field. This little network was trained ahead of time on examples where they already knew which solutions were correct and which were wrong. It learned to assign a score to each fingerprint. Correct solutions get a low score and incorrect ones get a high one.
So the process is to generate multiple solutions, get their fingerprints, score each one, and pick the lowest. Only that one gets tested. The Cost Field picks correctly about 88% of the time according to the repo.
zar1048576 20 hours ago
Really intriguing set of techniques to improve accuracy by generating multiple solutions. Even with the work to predict the most likely solutions, it's not clear to me based on the description how this could all be done efficiently. Would definitely be really impressive if it pans out on real-world use cases. Will look to kick the tires on this if I can get some time.
naasking 7 hours ago
yogthos 20 hours ago
imtringued 8 hours ago
I tried to read the project documentation, but I got overwhelmed by the aimless AI generated documentation that has a nebulous goal of documenting absolutely everything, but never explaining anything.
If the author actually wanted to explain his project he should have started with something along the lines of "Inference-time learning is the act of updating model parameters while you are generating tokens. Inference time learning is cost prohibitive for LLMs due to the need to update billions of parameters. However, what if updating billions of parameters wasn't necessary to begin with? What if you could instead have a much smaller model that merely scores a bunch of candidate output tokens? That model could be small enough for inference time learning to become viable and that's exactly what ATLAS does to achieve a 74.6% pass rate in LiveCodeBench and thereby outperforms Claude Sonnet with a small 14B open weight model that can be run locally on your $500 GPU."
This would have primed the reader to know what to look for. Instead you got this insurmountable wall of distractions.
Example: "combining constraint-driven generation, energy-based verification, self-verified iterative refinement, and adaptive routing"
That's a very long sequence of unexplained buzzwords that could mean absolutely anything.
MattRix 7 hours ago
I think this is because when you shrink it down, the model ends up space constrained and each “neuron” ends up having to do multiple duties. It can stil be tuned to perform well at specific tasks, but no longer generalizes as well. It’s somewhat unintuitive but models that are larger are often simpler than smaller ones for this same reason.
tgiba 11 hours ago
Despite skepticism I love to see experiments like that. If we all are able to run an open source model locally on mid-high end machines I'd be very happy.
electroglyph 14 hours ago
what's with the weird "Geometric Lens routing" ?? sounds like a made up GPTism
b3ing 15 hours ago
Will open source or local llms kill the big AI providers eventually? If so when? I can see maybe basic chat, not sure about coding and images yet
Tuna-Fish an hour ago
Centralized inference is more economically efficient⁰, and should be cheaper for most users once competition squeezes the air out of token prices. It remains very valid for anyone who wants to maintain their privacy, ofc.
0: Because the only way to get cache locality out of a LLM is to batch invocations. A centralized system where the server handles thousands of invocations at the same time only needs a tiny fraction of the total memory throughput as having all of those invocations run locally on different machines would.
jillesvangurp 12 hours ago
Not necessarily kill; but it will slowly push them off the critical path. Local agents can delegate to remote sub agents as needed but should default to local processing for low cost and latency reasons.
I think the notion of a one size fits all model that is a bit like a sports car in the sense that just get the biggest/fastest/best one is overkill; you use bigger models when needed. But they use a lot of resources and cost you a lot. A lot of AI work isn't solving important math or algorithm problems. Or leet coding exercises. Most AI work is mundane plumbing work, summarizing, a bit of light scripting/programming, tool calling, etc. With skills and guard rails, you actually want agents to follow those rather than get too creative. And you want them to work relatively quickly and not overthink things. Latency is important. You can actually use guard rails to decide when to escalate to bigger models and when not to.
throwaway85825 15 hours ago
Financial gravity will kill them when returns don't match stratospheric expectations.
bluefirebrand 14 hours ago
I hope so too, but I think it's wishful thinking. Be prepared for the mother of all financial bailouts from the world governments to make sure that doesn't happen
hollerith 14 hours ago
qingcharles 14 hours ago
Unless there are some really, really major shortcuts found in inference, then it's always going to be hard to run a really great model locally. The costs of the PC + electric will usually be crazy compared to a $20/mo Claude sub.
3836293648 11 hours ago
But that $20/month is still heavily subsidised. You have to compare to the API costs, not the direct subscription.
eigenspace 9 hours ago
It'd be nice if they do, but I don't really see how. Training these open-weight local LLMs is still insanely expensive and hard to do, even if it's cheaper and faster than what the big corps are doing.
I don't get the financial motive for someone to keep funding these open-weight model training programs other than just purposefully trying to kill the big AI providers.
nerbert 9 hours ago
Some open source models will cross the chasm, some big ai providers will too, and in both case they will have their specific use cases.
freekh 13 hours ago
This has been my theory for a while: during this autumn Apple will release a version of Apple Intelligence that runs locally and works better than ChatGPT. They will do this because 1) they do not have an offering in AI yet 2) they have amazing hardware that even now almost can pull it off on open models and this will not be possible to replicate on android for a long time (presumably)
This will crush OpenAI.
Note: I am not talking about coding here - it will take a while longer but when it is optimized to the bone and llms output has stabilized, you will be running that too on local hardware. Cost will come down for Claude and friends too but why pay 5 when you can have it for free?
oarsinsync 11 hours ago
> This has been my theory for a while: during this autumn Apple will release a version of Apple Intelligence that runs locally and works better than ChatGPT.
In this theory, can you explain why Apple has announced it’s paying Google for Gemini too?
Eventually, this may be true. This autumn? Highly unlikely.
freekh 7 hours ago
rudolph9 an hour ago
When Apple gets their shit together.
CJefferson 14 hours ago
They won't for coding and images, but they will socially. Everyone I know who has invested in home AI use is mostly using it for 'things that might get you banned/limited'.
Mashimo 14 hours ago
I'm quite impressed what is possible with just 12 to 16 GB of vram in terms of image generation.
alkonaut 5 hours ago
Great, it became a $1000 gpu while you were reading that.
emp17344 17 hours ago
Yet more evidence that the harness matters more than the model.
riidom 20 hours ago
Not a word about the tok/sec, unfortunately.
arjie 18 hours ago
It won’t be meaningful considering the architecture: it’s a harness around the model that generated multiple solutions in multiple passes using the test to measure compliance and repair broken solutions. The resulting program won’t be streamed to you because it has existed for minutes as it goes through the cycle. It’s more for an asynchronous use-case.
I, too, was interested because I am always eager to use local models in my claw-like. It looks like this could be useful for an async portion of the harness but it wouldn’t work in interactive contexts.
Very cool ensemble of techniques, particularly because they’re so accessible. I think I will use this form for reusable portions of web browsing functionality in my personal agent.
Octoth0rpe 17 hours ago
> A single patched llama-server runs on K3s, providing both generation with speculative decoding (~100 tok/s)
There seems to be at least some detail on that point.
bilekas 6 hours ago
Where is a RTX 5060 Ti 16 GB 500$?
Edit : The 8GB seems to hit this price but 16 not so much.
hedgehog 5 hours ago
They were $450 or so until recently, now... good luck.
dwa3592 5 hours ago
I wonder if it's working out for the benchmark problems only?
one expensive and hard lesson we will learn overtime is that you can't compress generality beyond a point.
bdbdbdb 11 hours ago
This is the kind of innovation I love to see. The big AI companies days are numbered if we can have the same quality in house
Aurornis 4 hours ago
This AI-written project is running its own LiveCodeBench on a completely different methodology. The AI-written notes even admit it:
> ATLAS scores are from 599 LCB tasks using the full V3 pipeline (best-of-3 + Lens selection + iterative repair) on a frozen 14B quantized model or "pass@k-v(k=3)". Competitor scores are single-shot pass@1 (zero-shot, temperature 0) from Artificial Analysis on 315 LCB problems -- not the same task set, so this is not a controlled head-to-head.
Instead of following the LiveCodeBench methodology, it's a harness that spins up a sandbox and spends a long time testing and refining the solution. If you did the same for Sonnet, GPT5.4, or other models they would also get significantly higher scores and they'd do it faster.
The AI-coded README is also full of signs of vibecoded slop like the discoveries that some of the complex structures implemented were not actually being used or contributing anything to the output.
0xbadcafebee 16 hours ago
This is specifically an experiment using ablation and multiple passes to improve the end result. Other techniques have been found that do this (like multiple passes through the same layers). But this technique - for this one specific model - seems to be both more performant, but also takes much longer, and requires more complexity. It's unlikely most people would use this technique, but it's interesting.
Temporary_31337 11 hours ago
the headline is pretty stupid - compares a model to a GPU that models run on. Somewhere in that data centre, some part of Sonnet infferencing runs on a 900$ GPU or maybe even cheaper Google tensor
15minutemail 12 hours ago
74% on LCB from a single 5060 Ti. I've been paying Anthropic per task and this guy is running it on electricity money, 20 minutes per task is rough for anything interactive though.
subroutine 12 hours ago
At 20 min per task you might as well code it yourself. Bill James needs to write a book on saber-metrics for LLM benchmarks.
negativegate 20 hours ago
Am I still SOL on AMD (9070 XT) when it comes to this stuff?
0xbadcafebee 16 hours ago
No? You can run any model that fits in its VRAM, and you can run larger models with layer/MoE offloading. Ask an AI what the best models you can run on that card are, then ask it for newer models than that. Ask what tuning options to pass to llama.cpp, and what the auto-tuning options are. Use ROCm builds.
It looks like your card has 16GB VRAM? Start with Qwen 3.5 9B Unsloth GGUFs (UD-Q6_K_XL) and branch out from there.
metalliqaz an hour ago
I've been running local models on my 9070XT and I have never found ROCm to be faster than Vulkan
patshead 18 hours ago
No, but yes? OmniCoder 9B at Q6 fits on my 9070 XT with 200k+ tokens of context, and it works pretty well with OpenCode. It is for sure the best local model that I've managed to squeeze onto my GPU, and it even works at 120k context at Q3 on an 8GB RX 580 GPU.
I can't imagine trying to using this model on either GPU for real work. I can use much bigger and faster models on the $3 Chutes subscription or $10 OpenCode Go subscription.
Even so, I am still excited. I don't feel like there was even a model worth using with a tool like OpenCode 6 to 9 months ago. I like the way things are heading, and I am looking forward to seeing how capable coding models of this size are in another 6 to 9 months!
hrmtst93837 2 hours ago
You can cram absurd context into a card now, but none of that matter once you hit the VRAM wall and the whole thing slows to a crawl. Cloud is cheaper. Local still matters for privacy and weird adapter stuff, but 'usable for work' is a much higher bar than 'looks decent on benchmarks' when the task is chewing through a repo without latency going to hell.
dangus 20 hours ago
Well, this specific solution was only set up on specific hardware, and is Nvidia dependent, as the readme stares.
That doesn’t mean the 9070XT can’t do AI stuff, quite the opposite. ROCm gets better all the time. There are many AI workloads you can do on AMD cards.
Is it a card I would choose if I was primarily working on AI? Absolutely not. But it is the card I own and it’s been a great value for gaming.
dannyw 18 hours ago
Unfortunately AMD is much worse with supporting AI features like FSR4 on older hardware generations, despite the capability and leaked INT8 models being there. Totally unlike NVIDIA.
It’s absurd I have to use open source programs to get INT8 FSR4 support.
sznio 11 hours ago
On that topic, anyone here got a decent local coding AI setup for a 12GB VRAM system? I have a Radeon 6700 XT and would like to run autocomplete on it. I can fit some models in the memory and they run quick but are just a tad too dumb. I have 64GB of system ram so I can run larger models and they are at least coherent, but really slow compared to running from VRAM.
mongrelion 8 hours ago
Not the answer that you are looking for, but I am a fellow AMD GPU owner, so I want to share my experience.
I have a 9070 XT, which has 16GB of VRAM. My understanding from reading around a bunch of forums is that the smallest quant you want to go with is Q4. Below that, the compression starts hurting the results quite a lot, especially for agentic coding. The model might eventually start missing brackets, quotes, etc.
I tried various AI + VRAM calculators but nothing was as on the point as Huggingface's built-in functionality. You simply sign up and configure in the settings [1] which GPU you have, so that when you visit a model page, you immediately see which of the quants fits in your card.
From the open source models out there, Qwen3.5 is the best right now. unsloth produces nice quants for it and even provides guidelines [2] on how to run them locally.
The 6-bit version of Qwen3.5 9B would fit nicely in your 6700 XT, but at 9B parameters, it probably isn't as smart as you would expect it to run.
Which model have you tried locally? Also, out of curiosity, what is your host configuration?
[1]: https://huggingface.co/settings/local-apps [2]: https://unsloth.ai/docs/models/qwen3.5
kroaton 6 hours ago
For autocomplete, Qwen 3.5 9B should be enough even at Q4_k_m. The upcoming coding/math Omnicoder-2 finetune might be useful (should be released in a few days).
Either that or just load up Qwen3.5-35B-A3B-Q4_K_S I'm serving it at about 40-50t/s on a 4070RTX Super 12GB + 64GB of RAM. The weights are 20.7GB + KV Cache (which should be lowered soon with the upcoming addition of TurboQuant).
mongrelion 2 hours ago
josefritzishere 6 hours ago
The core problem of AI remains unresolved, with no conceivable path to solvency. The issue is that AI isn't very good. It's OK, sometimes under very narrow criteria. But providing AI in reality very costly. Vague promises of it magically becoming better remain, very optimistic at best and still provide no route to solvency.
superkuh 19 hours ago
If anyone else was hoping this was using Q8 internally and that converted to Q4 it could fit in 12GB VRAM: unfortunately it's already at Q4_K_M (~9GB) and the the 16GB requirement is from other parts not a 14B@8bit+kv cache/etc you might guess.
limoce 18 hours ago
The title should be "Adaptive Test-time Learning and Autonomous Specialization".
felixagentai 17 hours ago
[flagged]
dang 16 hours ago
We've banned this account. Please don't post automated comments to HN.
Razengan 12 hours ago
Claude Code has been bleh or meh at best in my experience. There's so many posts on HN fawning about it lately that it could only be a guerrilla marketing campaign.
maipen 9 hours ago
You still need to give it precise context and instructions when dealing with things that are not web apps or some other software cliche.
The reasoning is great in opus, unbeatable at the moment.
I understand what you mean, it becomes disappointing on more niche or specific work. It’s honestly a good thing to see these models are not really intelligent yet.
Razengan 8 hours ago
I still don't trust any AI enough to generate or edit code, except for some throwaway experiments, because every time I tried it's been inefficient or too verbose or just plain wrong.
I use it for reviewing existing code, specifically for a components-based framework for Godot/GDScript at [0]. You can view the AGENTS.md and see that it's a relatively simple enough project: Just for 2D games and fairly modular so the AI can look at each file/class individually and have to cross-reference maybe 1-3 dependencies/dependents at most at any time during a single pass.
I've been using Codex, and it's helped me catch a lot of bugs that would have taken a long time on my own to even notice at all. Most of my productivity and the commits from the past couple months are thanks to that.
Claude on the other hand, oh man… It just wastes my time. It's had way more gaffes than Codex, on the exact same code and prompts.
dr_kiszonka 2 hours ago
spiderfarmer 10 hours ago
"I don't get it. Everyone else is wrong."
Razengan 9 hours ago
"There's no such thing as astroturfing." ok
I use Codex regularly and Claude is shit in comparison, from its constant "Oops you're right!!" backtracking to its crap Electron app (if their AI is so good why can't they make a fucking native app for each OS?)
Hell right freakin now I asked it to implement something and got a weird "Something went wrong" API error