Performance per dollar is getting faster and cheaper (wafer.ai)
321 points by latchkey 20 hours ago
minraws 18 hours ago
Can you folks add performance per watt as a metric to these comparisons, I honestly want to understand where AMD fits in the stack in terms of actual performance to dollars. I have had talks with companies wanting to build data centers outside of US and find it hard to source anything Nvidia in sufficient capacity and scale.
If AMD is competitive performance per watt and roughly reliable in terms of software support which is what most folks outside of US prioritize above all else, since outside of China and US electricity tends to at a relative premium.
Maybe if they make smaller data centers viable at the right price, AMD could be part of the stack outside of US where ever Nvidia is more limited in supply. Though I have genuinely no idea what sourcing an AMD GPU looks like.
I have never seen a company use AMD outside of wafer and a couple others mostly in US.
Genuinely intriguing or maybe not really (could be this stuff is common knowledge) and I am just stuck in my Nvidia bubble here.
kingstnap 16 hours ago
A DGX B200 costs like ~$0.5 M and uses around 14 kW.
If you plan to run it straight for 8 years 100% max usage thats around 1 GWhr.
A gigawatt hour is a lot of energy but its not that much compared to the price of the actual machine. In Germany for example with its expensive energy thats about €100k worth, which spread over 8 years is pretty minor compared to the up front half mill.
The real issue with high power consumption is not really the cost of energy but the limited powersupply you can get for a datacenter. A more efficient setup is highly desirable because it means you can fit more in the limited power hookup.
minraws 3 hours ago
It's not even about the costs, getting enough power for a large datacenter is impractically hard in most of the world at a single location.
If it's efficient and the power costs of not just ongoing costs but the upfront setup is lower that makes a lot different scales of data centers practical, especially for inference which doesn't need massive super clusters.
You can't just fire up gas turbines everywhere like US Data centers are doing. I am not even sure if that's legal in US...
Note you have to plan for peak usage and a lot of stuff large scale data centers are insane infrastructure projects.
Nvidia is both supply and price constrainted, sure if you are willing to pay over 0.5M$ you might get some, but if you try to balance out price to costs by going slightly lower on the pole you realize just how much more expensive Nvidia truly feels like AMD has a lot of margin to under cut them if they want to.
bayindirh 4 hours ago
> but the limited powersupply you can get for a datacenter.
Since many people haven't seen 10MW cabling for a data center or how a big GPU server is cabled, they naturally imagine connecting servers is akin to plugging an appliance to a wall.
When the electricity provider says "I neither have the capacity, nor the required cables in that area", thing gets real.
willis936 2 hours ago
What they're really asking the authors is "can you not lie about performance cost and do proper accounting?". You can spin any story if you cherry pick your framing sufficiently. Stopping right at the silicon packaging boundary is as meaningless as it seems.
The article is highly qualified but the headline is not. If they are not making general statements then they shouldn't open with them.
dannyw 13 hours ago
It’s more than power supply. Cooling and ventilation becomes a MUCH bigger deal at rack scale, and that costs electricity too.
bayindirh 4 hours ago
thereisnospork 12 hours ago
psychoslave 11 hours ago
heisenbit 8 hours ago
Plus the power needed for cooling adding maybe 50%.
jwpapi 9 hours ago
Interesting so it’s supply chain and then you need to calculate how long it can be utilized and for how much you can sell it.
Would love more calculations on that
Twirrim 18 hours ago
> I have never seen a company use AMD outside of wafer and a couple others mostly in US.
There's a few using them, and even more starting to experiment with them. AMD has long been a source of disappointment around this side of things, so I'm hesitant to feel optimistic we'll finally get some competition. The market really needs viable competition to Nvidia, especially performance/watt.
craftkiller 18 hours ago
> I have never seen a company use AMD
Meta is using AMD: https://www.amd.com/en/newsroom/press-releases/2026-2-24-amd...
And OpenAI: https://www.amd.com/en/newsroom/press-releases/2025-10-6-amd...
minraws 3 hours ago
OpenAI maybe, but a few friends in Meta said they don't so dunno man. Seems sus atm.
But it's meta they can get a GW up of AMD in a year
Schiendelman 17 hours ago
It's not clear when this will be - AMD has slipped these dates likely to 2027.
embedding-shape 9 hours ago
> I have never seen a company use AMD outside of wafer and a couple others mostly in US.
Worth remembering AMD basically "owns" (not literally) the hardware-side of things in video games consoles for good many years now, with no end in sight.
minraws 3 hours ago
I was talking in the data center gpu context, EPYCs are pretty common in data centers these days.
I have a huge EPYC based data center like 200-300+km from my house on the outskirts of the city a few dozen miles from a IT industry tech park(place with lots of IT company offices).
ekianjo 8 hours ago
Because they have x86 CPU licenses.
wongarsu 5 hours ago
embedding-shape 8 hours ago
duped 3 hours ago
7thpower 12 hours ago
Typically any company that can’t get Nvidia to fill their orders will have at least some AMD.
embedding-shape 8 hours ago
What type of company are you talking about here? Granted, nowadays I mostly interact with ML-adjacent companies but almost none would go "Hmm, hard to get nvidia hardware today, lets dump all expertise and knowledge of CUDA et al we have and start using AMD hardware until we can get nvidia", everyone would just wait or rent in the meantime.
wongarsu 5 hours ago
latchkey 16 hours ago
> I have never seen a company use AMD outside of wafer and a couple others mostly in US.
Just because you haven't seen it doesn't mean it doesn't exist.
We've serviced over 700 customers on our MI300x.
technoabsurdist 17 hours ago
AMD MI355X uses 1,400W per GPU and NVIDIA B200 uses 1,200W. So AMD uses about 16% more power.
vlovich123 17 hours ago
Not how you measure performance per watt but generally it’s 20-60% worse at tok/s/watt not 16. It does have ~50% more memory (~100gb) which complicates the comparison.
hassaanr 14 hours ago
While cool, quantization to FP4 is practically never lossless in actual use. A lot of providers are advertising high TPS on Kimi and GLM, but the models are functionally lobotomized and no longer close to frontier quality. Would love to see this not be true.
zozbot234 11 hours ago
Kimi uses INT4 as its native format, there's no such thing as "better than 4-bit precision" for that model. This is in contrast with GLM for which 16-bit precision is native and 8-bit is in common use.
hassaanr 7 hours ago
You’re right, but this poses a separate issue as the providers then do FP4 PTQ, which is quite lossy. Reduces the model size and optimizes for Blackwells at the (imo severe) cost of performance.
unrvl22 11 hours ago
MI355X can perform FP6 operations with the same speed as their FP4 (unique to AMD) - people should be making MXFP6 quants which would be pretty much lossless, and much closer to FP4 performance than FP8
Hugsun 3 hours ago
That can only be true if the workload is compute bound, not memory bandwidth bound.
minraws 2 hours ago
Doesn't Nvidia with their NVFP4 claim that it's lossless?
I haven't tested enough models Nvidia has converted to NVFP4 besides GLM 5.2 but it seemed fine to me.
My own luck has been hit or miss with it.
google234123 13 hours ago
First thing I noticed as well
tw1984 13 hours ago
from memory, it is like 96-98% of the accuracy.
lgessler 12 hours ago
Accuracy isn't a meaningful metric here without reference to a specific task.
flawn 8 hours ago
EduardoBautista 11 hours ago
And that 2%-4% makes all the difference.
fpaf 9 hours ago
nxtfari 15 hours ago
I think we should make it illegal to not specify the quantization in the headline for these types of posts.
ahmadyan 15 hours ago
Its MXFP4
IshKebab 7 hours ago
And to use the heading "Why this matters".
ozgrakkurt 7 hours ago
A nice filter is checking for the `.ai` in the end. It is very likely slop if you see that. Slop meaning low-effort/clickbait/shallow/useless/scam etc.
48484949 5 hours ago
triggered the grifters
mchusma 2 hours ago
I was hoping they would be discussing some path to improving things faster and cheaper. But in this post it looks like they offer quantized version for the same price as full version, and a fast version at much higher cost.
sometimelurker an hour ago
I like the metric of tok/joule a lot. it really brings to mind a lot of really nice ideas about energy and work and ideas and thought and efficiency
gcanyon an hour ago
Isn't this pretty much a given? Performance per dollar has to be a ratcheting function because how would something more expensive replace something less expensive?
p1esk 17 hours ago
There’s noticeable accuracy degradation when they switched from fp8 to mxfp4
greyb 14 hours ago
Wafer discontinued their own "Wafer Pass" flagship coding plan within weeks of launch and had to issue prorated refunds. Now they're bragging about squeezing costs down even further via quantization, even though their implementation is clearly lacking.
[1] https://www.ycombinator.com/launches/Q9i-wafer-pass-flat-rat...
throwdbaaway 16 hours ago
And somehow they claimed that it is "lossless".
ilaksh 3 hours ago
The compute-in-memory and neuromorphic paradigms are likely to push this much, much farther over the next decade as more radical improvements make it out of the lab. Sooner or later it will involve new materials and new nano devices and providing multiple orders of magnitude better efficiency. And just scaling up existing things like MRAM.
tim333 6 hours ago
Not a new phenomena - performance per dollar has been fairly steadily exponentialling since 1900 or so
1900 - 2010 https://www.thekurzweillibrary.com/exponential-growth-of-com...
1939 - 2023 https://medium.com/@timventura/kurzweils-law-for-the-ai-age-...
Schiendelman 17 hours ago
I'm not surprised to see competition with Blackwell. Rubin is 5x faster than Blackwell at inference - Blackwell is the last generation Nvidia didn't optimize specifically for inference.
If I'm missing something, please let me know!
boroboro4 14 hours ago
It's very unclear what's special in Rubin to be optimized for inference? I can see disaggregated bit (with having separate prefill and decoding nodes), but what else?
villgax 13 hours ago
Lot more SMs & Tensor Cores for NVFP4 going by the looks of it.
nullc 16 hours ago
how do you get 5x faster at inference when inference is memory bandwidth limited? getting 5x the memory bandwidth of a h100 seems physically difficult.
Schiendelman 16 hours ago
Rubin has 22TB/s of memory bandwidth vs Blackwell's 8TB/s. NVLink 6 doubles interconnect speed. Plus they're moving to 3nm from ~4nm.
(Previously this comment said Rubin did native NVFP4, but Blackwell does too! Rubin just also trains with native NVFP4, which Blackwell does not.)
boredatoms 16 hours ago
zackangelo 15 hours ago
unrvl22 11 hours ago
inference is only memory bandwidth limited when targeting higher tps / high single stream tps. the weights only need to be moved across once per forward pass, when you batch say 100 streams per forward pass (which is what most inference services do / care about) its compute bottlenecked.
AussieWog93 18 hours ago
The 2600 tok/s is an "aggregate", not the actual throughput.
technoabsurdist 18 hours ago
yes it is 213 tok/s single stream (so per user)
unrvl22 11 hours ago
that 213 wasn't achieved when saturated though. was probably more like 30 tps per stream when doing 2.6k tps.
3836293648 17 hours ago
So per subagent*.
alienbaby 16 hours ago
conorcleary 4 hours ago
*especially as many currencies weaken
johanvts 8 hours ago
That sounds literally impossible.
dtgriscom 4 hours ago
Agreed. The writer is pretty loose with their comparisons:
* What does it mean for "performance per dollar" to get faster? Higher, maybe; rise faster than it has in the past, maybe, but just "faster"? Nope.
* The article cites some equipment as being "2x cheaper". I think they mean "half the cost", but if so they should say it.
oDot 18 hours ago
Do these providers have 80+% gross margins or is something eating into them? Maybe utilization?
technoabsurdist 18 hours ago
hi i work at wafer. no the margins are lower averaging at about ~40%. utilization is one of the highest order bits in determining margins here, yes.
adammarples 7 hours ago
Slight criticism of the headline there, you can't get cheaper per dollar.
hahahaa 9 hours ago
What is a knee, in performance talk?
kgwgk 9 hours ago
A place where the slope/derivative/incremental-performance-per-price changes.
nnevatie 9 hours ago
I used to be high-performance like you, then I took an arrow to the knee?
alienbaby 16 hours ago
I'm interested if anyone knows how much legwork the assumed 60% cache hit, plus running a quantised model is doing? Esp. compared to what the headline half implies is a full fat GLM5.2
ilaksh 3 hours ago
Can you actually rent an MI355X per hour anywhere right now?
killingtime74 15 hours ago
No word on what this actually means as a consumer. What's the price. Is it lower than NVIDIA serving?
mixtureoftakes 13 hours ago
They seem to be serving it at 3x the price while also struggling with maintaining uptime on openrouter; while the vercel router advertizes even bigger speeds but has no clear uptime stats
I guess you really do have to try it at least for some time to actually know
BurningFrog 3 hours ago
So... the headline is about performance per dollar per dollar?
beffjezos 14 hours ago
This is very interesting and yet not at the same time. This looks to be optimized for single-stream LLM traffic which is not viable to serve in a production setting. It's only interesting to hobbyists that want to run the model locally.
It's genuinely neat that AI can find the right optimization pathways in an AMD inference server to unlock this but at the same token (pun-intended) this is a classic case of benchmark hacking that doesn't stand up to real-world application.
wmf 14 hours ago
You got it backwards; it's ~200 on single stream so the 2,600 is achieved with ~13 streams.
beffjezos 14 hours ago
Yeah that makes sense. I'm more familiar with seeing tok/s/user + TTFT rather than the total node throughput.
technoabsurdist 14 hours ago
hi yes it’s not optimized for single stream it’s optimized for total node throughput
beffjezos 14 hours ago
Oh, that's much better then. A good metric to share is the tokens per second per user for the node rather than the total throughput of the node. It disambiguates what's being optimized for much better than your blog post currently does.
technoabsurdist 12 hours ago
gowthamsaiyadav 7 hours ago
world is not limited by Nvidia, AMD can be used
calin2k 10 hours ago
then why is token per dollar getting more expensive?
ilaksh 3 hours ago
There are a limited number of these available in comparison to demand. I think people figured out that LLMs and VLMs can do real work that can replace a lot of humans. And for plenty of jobs, it's good enough to reduce already outsourced staff by 75-90% at a fraction of the cost.
FeepingCreature 7 hours ago
Because lots of people are willing to pay more dollar for smarter token.
AtlasBarfed 10 hours ago
Because they are dumping/subsidizing it token processing to try and get companies to fire as many people as possible. So they'll be dependent upon the companies when they have to Jack the rates
yieldcrv 18 hours ago
Agentic coding drivers for different architectures is a massive unlock for the world
So much compute is under utilized waiting for a savant or company to prioritize an architecture, and now all the other engineers can tackle this at any time if they get inspired on the right prompts
technoabsurdist 17 hours ago
this is exactly our thesis at wafer :) thank you for the support
yieldcrv 13 hours ago
well done
yogthos 17 hours ago
Personally, I can't wait till something like this starts getting to consumer level. https://www.anuragk.com/blog/posts/Taalas.html
yieldcrv 17 hours ago
That’s pretty fascinating, Apple has some innocuous LLMs and transformers baked into its devices and leveraging their neural chipset
So I could see something like this where the neural chipset has an LLM that cant be so easily updated baked into it, until you get a new device
yogthos 4 hours ago
zuzululu 12 hours ago
yeah but we are still far far away from being able to run the frontier model equivalents locally without significant quantization
even having something like opus 4.8 locally would completely change the landscape
villgax 13 hours ago
They fail to mention non speculative numbers & whether baseline was nvfp4 as well. So much for erosion against an older gen
bitwize 10 hours ago
(in a high-pitched, pathetic regency-era British orphan voice) Please sir, may I have some compute as well?
shevy-java 10 hours ago
But RAM prices skyrocketed!
The AI companies owe use money. As does e. g. NVIDIA for becoming a cartel.