Anonymous request-token comparisons from Opus 4.6 and Opus 4.7 (tokens.billchambers.me)
401 points by anabranch 8 hours ago
andai 6 hours ago
For a fair comparison you need to look at the total cost, because 4.7 produces significantly fewer output tokens than 4.6, and seems to cost significantly less on the reasoning side as well.
Here is a comparison for 4.5, 4.6 and 4.7 (Output Tokens section):
https://artificialanalysis.ai/?models=claude-opus-4-7%2Cclau...
4.7 comes out slightly cheaper than 4.6. But 4.5 is about half the cost:
https://artificialanalysis.ai/?models=claude-opus-4-7%2Cclau...
Notably the cost of reasoning has been cut almost in half from 4.6 to 4.7.
I'm not sure what that looks like for most people's workloads, i.e. what the cost breakdown looks like for Claude Code. I expect it's heavy on both input and reasoning, so I don't know how that balances out, now that input is more expensive and reasoning is cheaper.
On reasoning-heavy tasks, it might be cheaper. On tasks which don't require much reasoning, it's probably more expensive. (But for those, I would use Codex anyway ;)
matheusmoreira 3 hours ago
It thinks less and produces less output tokens because it has forced adaptive thinking that even API users can't disable. Same adaptive thinking that was causing quality issues in Opus 4.6 not even two weeks ago. The one bcherny recommended that people disable because it'd sometimes allocate zero thinking tokens to the model.
https://news.ycombinator.com/item?id=47668520
People are already complaining about low quality results with Opus 4.7. I'm also spotting it making really basic mistakes.
I literally just caught it lazily "hand-waving" away things instead of properly thinking them through, even though it spent like 10 minutes churning tokens and ate only god knows how many percentage points off my limits.
> What's the difference between this and option 1.(a) presented before?
> Honestly? Barely any. Option M is option 1.(a) with the lifecycle actually worked out instead of hand-waved.
> Why are you handwaving things away though? I've got you on max effort. I even patched the system prompts to reduce this.
> Fair call. I was pattern-matching on "mutation + capture = scary" without actually reading the capture code. Let me do the work properly.
> You were right to push back. I was wrong. Let me actually trace it properly this time.
> My concern from the first pass was right. The second pass was me talking myself out of it with a bad trace.
It's just a constant stream of self-corrections and doubts. Opus simply cannot be trusted when adaptive thinking is enabled.
Can provide session feedback IDs if needed.
codethief an hour ago
> > Why are you handwaving things away though? I've got you on max effort. I even patched the system prompts to reduce this.
In my experience, prompts like this one, which 1) ask for a reason behind an answer (when the model won't actually be able to provide one), 2) are somewhat standoff-ish, don't work well at all. You'll just have the model go the other way.
What works much better is to tell the model to take a step back and re-evaluate. Sometimes it also helps to explicitly ask it to look at things from a different angle XYZ, in other words, to add some entropy to get it away from the local optimum it's currently at.
matheusmoreira an hour ago
rectang 2 hours ago
Are the benchmarks being used to measure these models biased towards completing huge and highly complex tasks, rather than ensuring correctness for less complex tasks?
It seems like they're working hard to prioritize wrapping their arms around huge contexts, as opposed to handling small tasks with precision. I prefer to limit the context and the scope of the task and focus on trying to get everything right in incremental steps.
matheusmoreira an hour ago
QuantumGood 3 hours ago
Some have defined "fair" as tests of the same model at different times, as the behavior and token usage of a model changes despite the version number remaining the same. So testing model numbers at different times matters, unfortunately, and that means recent tests might not be what you would want to compare to future tests.
hgoel 7 hours ago
The bump from 4.6 to 4.7 is not very noticeable to me in improved capabilities so far, but the faster consumption of limits is very noticeable.
I hit my 5 hour limit within 2 hours yesterday, initially I was trying the batched mode for a refactor but cancelled after seeing it take 30% of the limit within 5 minutes. Had to cancel and try a serial approach, consumed less (took ~50 minutes, xhigh effort, ~60% of the remaining allocation IIRC), but still very clearly consumed much faster than with 4.6.
It feels like every exchange takes ~5% of the 5 hour limit now, when it used to be maybe ~1-2%. For reference I'm on the Max 5x plan.
For now I can tolerate it since I still have plenty of headroom in my limits (used ~5% of my weekly, I don't use claude heavily every day so this is OK), but I hope they either offer more clarity on this or improve the situation. The effort setting is still a bit too opaque to really help.
matheusmoreira 2 hours ago
The most frustrating part is the quality loss caused by the forced adaptive thinking. It eats 5-10% of my Max 5x usage and churns for ten minutes, only to come back with totally untrustworthy results. It lazily hand-waves issues away in order to avoid reading my actual code and doing real reasoning work on it. Opus simply cannot be trusted if adaptive thinking is enabled.
_blk 6 hours ago
From what I understand you shouldn't wait more than 5min between prompts without compacting or clearing or you'll pay for reinitializing the cache. With compaction you still pay but it's less input tokens. (Is compaction itself free?)
krackers 2 hours ago
>pay for reinitializing the cache
Why can't they save the kv cache to disk then later reload it to memory?
stavros 25 minutes ago
gck1 4 hours ago
Cache ttl on max subscriptions is 1h, FYI.
bashtoni 2 hours ago
_blk 4 hours ago
conception 6 hours ago
Yeah the caching change is probably 90% of “i run out of usage so fast now!” Issues.
hgoel 6 hours ago
Ah I can see how my phrasing might be misleading, but these prompts were made within 5 minutes of each other, the timing I mentioned were what Claude spent working.
trueno 4 hours ago
is it 5 mins between constant prompting/work or 5 mins as in if i step away from the comp for 5 mins and comp back and prompt again im not subject to reinit?
if it's the latter that's crazy. i dont even know what to do there, compactions already feel like a memory wipe
glerk 6 hours ago
I'd be ok with paying more if results were good, but it seems like Anthropic is going for the Tinder/casino intermittent reinforcement strategy: optimized to keep you spending tokens instead of achieving results.
And yes, Claude models are generally more fun to use than GPT/Codex. They have a personality. They have an intuition for design/aesthetics. Vibe-coding with them feels like playing a video game. But the result is almost always some version of cutting corners: tests removed to make the suite pass, duplicate code everywhere, wrong abstraction, type safety disabled, hard requirements ignored, etc.
These issues are not resolved in 4.7, no matter what the benchmarks say, and I don't think there is any interest in resolving them.
Bridged7756 6 hours ago
Mirrors my sentiment. Those tools seem mostly useful for a Google alternative, scaffolding tedious things, code reviewing, and acting as a fancy search.
It seems that they got a grip on the "coding LLM" market and now they're starting to seek actual profit. I predict we'll keep seeing 40%+ more expensive models for a marginal performance gain from now on.
danny_codes 6 hours ago
I just don’t see how they’ll be able to make a profit. Open models have the same performance on coding tasks now. The incentives are all wrong. Why pay more for a model that’s no better and also isn’t open? It’s nonsense
braebo 3 hours ago
Bridged7756 2 hours ago
holoduke 3 hours ago
You have to guide an ai. Not let roam freely. If you got skills to guide you can make it output high quality
xpe 6 hours ago
> ... but it seems like Anthropic is going for the Tinder/casino intermittent reinforcement strategy: optimized to keep you spending tokens instead of achieving results.
This part of the above comment strikes me as uncharitable and overconfident. And, to be blunt, presumptuous. To claim to know a company's strategy as an outsider is messy stuff.
My prior: it is 10X to 20X more likely Anthropic has done something other than shift to a short-term squeeze their customers strategy (which I think is only around ~5%)
What do I mean by "something other"? (1) One possibility is they are having capacity and/or infrastructure problems so the model performance is degraded. (2) Another possibility is that they are not as tuned to to what customers want relative to what their engineers want. (3) It is also possible they have slowed down their models down due to safety concerns. To be more specific, they are erring on the side of caution (which would be consistent with their press releases about safety concerns of Mythos). Also, the above three possibilities are not mutually exclusive.
I don't expect us (readers here) to agree on the probabilities down to the ±5% level, but I would think a large chunk of informed and reasonable people can probably converge to something close to ±20%. At the very least, can we agree all of these factors are strong contenders: each covers maybe at least 10% to 30% of the probability space?
How short-sighted, dumb, or back-against-the-wall would Anthropic have to be to shift to a "let's make our new models intentionally _worse_ than our previous ones?" strategy? Think on this. I'm not necessarily "pro" Anthropic. They could lose standing with me over time, for sure. I'm willing to think it through. What would the world have to look like for this to be the case.
There are other factors that push back against claims of a "short-term greedy strategy" argument. Most importantly, they aren't stupid; they know customers care about quality. They are playing a longer game than that.
Yes, I understand that Opus 4.7 is not impressing people or worse. I feel similarly based on my "feels", but I also know I haven't run benchmarks nor have I used it very long.
I think most people viewed Opus 4.6 as a big step forward. People are somewhat conditioned to expect a newer model to be better, and Opus 4.7 doesn't match that expectation. I also know that I've been asking Claude to help me with Bayesian probabilistic modeling techniques that are well outside what I was doing a few weeks ago (detailed research and systems / software development), so it is just as likely that I'm pushing it outside its expertise.
glerk 5 hours ago
> To claim to know a company's strategy as an outsider is messy stuff.
I said "it seems like". Obviously, I have no idea whether this is an intentional strategy or not and it could as well be a side effect of those things that you mentioned.
Models being "worse" is the perceived effect for the end user (subjectively, it seems like the price to achieve the same results on similar tasks with Opus has been steadily increasing). I am claiming that there is no incentive for Anthropic to address this issue because of their business model (maximize the amount of tokens spent and price per token).
kalkin 7 hours ago
AFAICT this uses a token-counting API so that it counts how many tokens are in the prompt, in two ways, so it's measuring the tokenizer change in isolation. Smarter models also sometimes produce shorter outputs and therefore fewer output tokens. That doesn't mean Opus 4.7 necessarily nets out cheaper, it might still be more expensive, but this comparison isn't really very useful.
h14h 7 hours ago
For some real data, Artificial Analysis reported that 4.6 (max) and 4.7 (max) used 160M tokens and 100M tokens to complete their benchmark suite, respectively:
https://artificialanalysis.ai/?intelligence-efficiency=intel...
Looking at their cost breakdown, while input cost rose by $800, output cost dropped by $1400. Granted whether output offsets input will be very use-case dependent, and I imagine the delta is a lot closer at lower effort levels.
theptip 5 hours ago
This is the right way of thinking end-to-end.
Tokenizer changes are one piece to understand for sure, but as you say, you need to evaluate $/task not $/token or #tokens/task alone.
manmal 7 hours ago
Why is it not useful? Input token pricing is the same for 4.7. The same prompt costs roughly 30% more now, for input.
dktp 7 hours ago
The idea is that smarter models might use fewer turns to accomplish the same task - reducing the overall token usage
Though, from my limited testing, the new model is far more token hungry overall
manmal 7 hours ago
kalkin 7 hours ago
That's valid, but it's also worth knowing it's only one part of the puzzle. The submission title doesn't say "input".
SkyPuncher 7 hours ago
Yes. I actually noticed my token usage go down on 4.6 when I started switching every session to max effort. I got work done faster with fewer steps because thinking corrected itself before it cycled.
I’ve noticed 4.7 cycling a lot more on basic tasks. Though, it also seems a bit better at holding long running context.
the_gipsy 6 hours ago
With AIs, it seems like there never is a comparison that is useful.
theptip 5 hours ago
You can build evals. Look at Harbor or Inspect. It’s just more work than most are interested in doing right now.
jascha_eng 6 hours ago
yup its all vibes. And anthropic is winning on those in my book still
rectang 7 hours ago
For now, I'm planning to stick with Opus 4.5 as a driver in VSCode Copilot.
My workflow is to give the agent pretty fine-grained instructions, and I'm always fighting agents that insist on doing too much. Opus 4.5 is the best out of all agents I've tried at following the guidance to do only-what-is-needed-and-no-more.
Opus 4.6 takes longer, overthinks things and changes too much; the high-powered GPTs are similarly flawed. Other models such as Sonnet aren't nearly as good at discerning my intentions from less-than-perfectly-crafted prompts as Opus.
Eventually, I quit experimenting and just started using Opus 4.5 exclusively knowing this would all be different in a few months anyway. Opus cost more, but the value was there.
But now I see that 4.7 is going to replace both 4.5 and 4.6 in VSCode Copilot, and with a 7.5x modifier. Based on the description, this is going to be a price hike for slower performance — and if the 4.5 to 4.6 change is any guide, more overthinking targeted at long-running tasks, rather than fine-grained. For me, that seems like a step backwards.
axpy906 3 hours ago
Why not just use Sonnet?
rectang 2 hours ago
I've used Sonnet a lot. It is not as good as Opus at understanding what I'm asking for. I have to coach Sonnet more closely, taking more care to be precise in my prompts, and often building up Plan steps when I could just YOLO an Agent instruction at Opus and it would get it right.
I find that Opus is really good at discerning what I mean, even when I don't state it very clearly. Sonnet often doesn't quite get where I'm going and it sometimes builds things that don't make sense. Sonnet also occasionally makes outright mistakes, like not catching every location that needs to be changed; Opus makes nearly every code change flawlessly, as if it's thinking through "what could go wrong" like a good engineer would.
Sonnet is still better than older and/or less-capable models like GPT 4.1, Raptor mini (Preview), or GPT-5 mini, which all fail in the same way as Sonnet but more dramatically... but Opus is much better than Sonnet.
Recent full-powered GPTs (including the Codex variants) are competitive with Opus 4.6, but Opus 4.5 in particular is best in class for my workflow. I speculate that Opus 4.5 dedicates the most cycles out of all models to checking its work and ensuring correctness — as opposed to reaching for the skies to chase ambitious, highly complex coding tasks.
trueno 4 hours ago
> 4.7 is going to replace both 4.5 and 4.6
as in 4.5 is no longer going to be avail? F.
ive also been sticking with 4.5 that sucks
rectang 3 hours ago
https://github.blog/changelog/2026-04-16-claude-opus-4-7-is-...
> Over the coming weeks, Opus 4.7 will replace Opus 4.5 and Opus 4.6 in the model picker for Copilot Pro+[...]
> This model is launching with a 7.5× premium request multiplier as part of promotional pricing until April 30th.
gsleblanc 6 hours ago
It's increasingly looking naive to assume scaling LLMs is all you need to get to full white-collar worker replacement. The attention mechanism / hopfield network is fundamentally modeling only a small subset of the full human brain, and all the increasing sustained hype around bolted-on solutions for "agentic memory" is, in my opinion, glaring evidence that these SOTA transformers alone aren't sufficient even when you just limit the space to text. Maybe I'm just parroting Yann LeCun.
mohamedkoubaa 2 hours ago
I think they're as good as they're going to get from scaling. They can still get more efficient, and tooling/harnesses around them will improve.
ACCount37 5 hours ago
You probably are.
The "small subset" argument is profoundly unconvincing, and inconsistent with both neurobiology of the human brain and the actual performance of LLMs.
The transformer architecture is incredibly universal and highly expressive. Transformers power LLMs, video generator models, audio generator models, SLAM models, entire VLAs and more. It not a 1:1 copy of human brain, but that doesn't mean that it's incapable of reaching functional equivalence. Human brain isn't the only way to implement general intelligence - just the one that was the easiest for evolution to put together out of what it had.
LeCun's arguments about "LLMs can't do X" keep being proven wrong empirically. Even on ARC-AGI-3, which is a benchmark specifically designed to be adversarial to LLMs and target the weakest capabilities of off the shelf LLMs, there is no AI class that beats LLMs.
bigyabai 5 hours ago
> Human brain isn't the only way to implement general intelligence - just the one that was the easiest for evolution to put together out of what it had.
The human brain is not a pretrained system. It's objectively more flexible than than transformers and capable of self-modulation in ways that no ML architecture can replicate (that I'm aware of).
ACCount37 5 hours ago
aerhardt 6 hours ago
> you just limit the space to text
And even then... why can't they write a novel? Or lowering the bar, let's say a novella like Death in Venice, Candide, The Metamorphosis, Breakfast at Tiffany's...?
Every book's in the training corpus...
Is it just a matter of someone not having spent a hundred grand in tokens to do it?
voxl 6 hours ago
I know someone spending basically every day writing personal fan fiction stories using every model you can find. She doesn't want to share it, and does complain about it a lot, seems like maintaining consistency for something say 100 pages long is difficult
conception 6 hours ago
I don’t understand - there are hundreds/thousands of AI written books available now.
aerhardt 6 hours ago
zozbot234 5 hours ago
Never mind novels, it can't even write a good Reddit-style or HN-style comment. agentalcove.ai has an archive of AI models chatting to one another in "forum" style and even though it's a good show of the models' overall knowledge the AIisms are quite glaring.
mh- 4 hours ago
colechristensen 6 hours ago
Who says they can't? What's your bar that needs to be passed in order for "written a novella" to be achieved?
There's a lot of bad writing out there, I can't imagine nobody has used an LLM to write a bad novella.
aerhardt 6 hours ago
tiffanyh 7 hours ago
I was using Opus 4.7 just yesterday to help implement best practices on a single page website.
After just ~4 prompts I blew past my daily limit. Another ~7 more prompts & I blew past my weekly limit.
The entire HTMl/CSS/JS was less than 300 lines of code.
I was shocked how fast it exhausted my usage limits.
hirako2000 7 hours ago
I haven't used Claude. Because I suspect this sort of things to come.
With enterprise subscription, the bill gets bigger but it's not like VP can easily send a memo to all its staff that a migration is coming.
Individuals may end their subscription, that would appease the DC usage, and turn profits up.
fooster 3 hours ago
Sorry you are missing out. I use claude all day every day with max and what people are reporting here has not been my experience. My current usage is 16% and it resets Thursday.
zaptrem 6 hours ago
What's your reasoning effort set to? Max now uses way more tokens and isn't suggested for most usecases. Even the new default (xhigh) uses more than the old default (medium).
nixpulvis 2 hours ago
That's what I'm wondering. Is it people are defaulting to xhigh now and that's why it feels like it's consuming a lot more tokens? If people manually set it to medium, would it be comparable?
sync 7 hours ago
Which plan are you on? I could see that happening with Pro (which I think defaults to Sonnet?), would be surprised with Max…
templar_snow 7 hours ago
It eats even the Max plan like crazy.
tiffanyh 7 hours ago
Pro. It even gave me $20 free credits, and exhausted free credits nearly instantly.
tomtomistaken 7 hours ago
Are you using Claude subscription? Because that's not how it works there.
someuser54541 8 hours ago
Should the title here be 4.6 to 4.7 instead of the other way around?
freak42 8 hours ago
absolutely!
UltraSane 8 hours ago
Writing Opus 4.6 to 4.7 does make more sense for people who read left to right.
pixelatedindex 7 hours ago
I’m impressed with anyone who can read English right to left.
jlongman 7 hours ago
einpoklum 7 hours ago
embedding-shape 7 hours ago
But the page is not in a language that should be read right to left, doesn't that make that kind of confusing?
usrnm 7 hours ago
bee_rider 7 hours ago
Err, how so?
hereme888 6 hours ago
> Opus 4.7 (Adaptive Reasoning, Max Effort) cost ~$4,406 to run the Artificial Analysis Intelligence Index, ~11% less than Opus 4.6 (Adaptive Reasoning, Max Effort, ~$4,970) despite scoring 4 points higher. This is driven by lower output token usage, even after accounting for Opus 4.7's new tokenizer. This metric does not account for cached input token discounts, which we will be incorporating into our cost calculations in the near future.
bertil 5 hours ago
My impression is that the quality of the conversation is unexpectedly better: more self-critical, the suggestions are always critical, the default choices constantly best. I might not have as many harnesses as most people here, so I suspect it’s less obvious but I would expect this to make it far more valuable for people who haven’t invested as much.
After a few basic operations (retrospective look at the flow of recent reviews, product discussions) I would expect this to act like a senior member of the team, while 4.6 was good, but far more likely to be a foot-gun.
npollock 4 hours ago
You can configure the status line to get a feel for token usage:
[Opus 4.6] 3% context | last: 5.2k in / 1.1k out
add this to .claude/settings.json
"statusLine": { "type": "command", "command": "jq -r '\"[\\(.model.display_name)] \\(.context_window.used_percentage // 0)% context | last: \\(((.context_window.current_usage.input_tokens // 0) / 1000 * 10 | floor / 10))k in / \\(((.context_window.current_usage.output_tokens // 0) / 1000 * 10 | floor / 10))k out\"'" }
dakiol 7 hours ago
We dropped Claude. It's pretty clear this is a race to the bottom, and we don't want a hard dependency on another multi-billion dollar company just to write software
We'll be keeping an eye on open models (of which we already make good use of). I think that's the way forward. Actually it would be great if everybody would put more focus on open models, perhaps we can come up with something like the "linux/postgres/git/http/etc" of the LLMs: something we all can benefit from while it not being monopolized by a single billionarie company. Wouldn't it be nice if we don't need to pay for tokens? Paying for infra (servers, electricity) is already expensive enough
ahartmetz 7 hours ago
>we don't want a hard dependency on another multi-billion dollar company just to write software
One of two main reasons why I'm wary of LLMs. The other is fear of skill atrophy. These two problems compound. Skill atrophy is less bad if the replacement for the previous skill does not depend on a potentially less-than-friendly party.
post-it 7 hours ago
I was worried about skill atrophy. I recently started a new job, and from day 1 I've been using Claude. 90+% of the code I've written has been with Claude. One of the earlier tickets I was given was to update the documentation for one of our pipelines. I used Claude entirely, starting with having it generate a very long and thorough document, then opening up new contexts and getting it to fact check until it stopped finding issues, and then having it cut out anything that was granular/one query away. And then I read what it had produced.
It was an experiment to see if I could enter a mature codebase I had zero knowledge of, look at it entirely through an AI, and come to understand it.
And it worked! Even though I've only worked on the codebase through Claude, whenever I pick up a ticket nowadays I know what file I'll be editing and how it relates to the rest of the code. If anything, I have a significantly better understanding of the codebase than I would without AI at this point in my onboarding.
estetlinus 6 hours ago
root_axis 4 hours ago
Ifkaluva 4 hours ago
SpicyLemonZest 6 hours ago
viccis 3 hours ago
ljm 7 hours ago
Not so much atrophy as apathy.
I've worked with people who will look at code they don't understand, say "llm says this", and express zero intention of learning something. Might even push back. Be proud of their ignorance.
It's like, why even review that PR in the first place if you don't even know what you're working with?
psygn89 6 hours ago
oremj 6 hours ago
kilroy123 6 hours ago
RexM 6 hours ago
monkpit 6 hours ago
redanddead 6 hours ago
tossandthrow 7 hours ago
You can argu that you will have skill atrophy by not using LLMs.
We have gone multi cloud disaster recovery on our infrastructure. Something I would not have done yet, had we not had LLMs.
I am learning at an incredible rate with LLMs.
mgambati 7 hours ago
ori_b 7 hours ago
weego 5 hours ago
Wowfunhappy 5 hours ago
bluefirebrand 7 hours ago
jjallen 7 hours ago
i_love_retros 7 hours ago
deadbabe 7 hours ago
solarengineer 6 hours ago
https://hex.ooo/library/power.html
When future humans rediscover mathematics.
IgorPartola 4 hours ago
Yeah I am worried about skill atrophy too. Everyone uses a compiler these days instead of writing assembly. Like who the heck is going to do all the work when people forget how to use the low level tools and a compiler has a bug or something?
And don’t get me started on memory management. Nobody even knows how to use malloc(), let alone brk()/mmap(). Everything is relying on automatic memory management.
I mean when was the last time you actually used your magnetized needle? I know I am pretty rusty with mine.
otabdeveloper4 4 hours ago
techpression 4 hours ago
dgellow 6 hours ago
Another aspect I haven’t seen discussed too much is that if your competitor is 10x more productive with AI, and to stay relevant you also use AI and become 10x more productive. Does the business actually grow enough to justify the extra expense? Or are you pretty much in the same state as you were without AI, but you are both paying an AI tax to stay relevant?
xixixao 6 hours ago
This is the “ad tax” reasoning, but ultimately I think the answer is greater efficiency. So there is a real value, even if all competitors use the tools.
It’s like saying clothing manufacturers are paying the “loom tax” tax when they could have been weaving by hand…
SlinkyOnStairs 6 hours ago
bigbadfeline 5 hours ago
redanddead 6 hours ago
The alternative is probably also true. If your F500 competitor is also handicapped by AI somehow, then you're all stagnant, maybe at different levels. Meanwhile Anthropic is scooping up software engineers it supposedly made irrelevant with Mythos and moving into literally 2+ new categories per quarter
dakiol 5 hours ago
Where's the evidence of competitors being 10x more productive? So far, everyone is simply bragging about how much code they have shipped last week, but that has zero relevance when it comes to productivity
davidron 2 hours ago
dgellow 4 hours ago
Silhouette 4 hours ago
Lihh27 6 hours ago
it's worse than a tie. 10x everyone just floods the market and tanks per-unit price. you pay the AI tax and your output is worth less.
JambalayaJimbo 6 hours ago
If the business doesn’t grow then you shed costs like employees
senordevnyc 6 hours ago
Either the business grows, or the market participants shed human headcount to find the optimal profit margin. Isn’t that the great unknown: what professions are going to see headcount reduction because demand can’t grow that fast (like we’ve seen in agriculture), and which will actually see headcount stay the same or even expand, because the market has enough demand to keep up with the productivity gains of AI? Increasingly I think software writ large is the latter, but individual segments in software probably are the former.
otabdeveloper4 4 hours ago
> your competitor is 10x more productive with AI
This doesn't happen. Literally zero evidence of this.
dgellow 4 hours ago
michaelje 6 hours ago
Open models keep closing the eval gap for many tasks, and local inference continues to be increasingly viable. What's missing isn't technical capability, but productized convenience that makes the API path feel like the only realistic option.
Frontier labs are incentivized to keep it that way, and they're investing billions to make AI = API the default. But that's a business model, not a technical inevitability.
trueno 4 hours ago
im hoping and praying that local inference finds it's way to some sort of baseline that we're all depending on claude for here. that would help shape hardware designs on personal devices probably something in the direction of what apple has been doing.
ive had to like tune out of the LLM scene because it's just a huge mess. It feels impossible to actually get benchmarks, it's insanely hard to get a grasp on what everyone is talking about, bots galore championing whatever model, it's just way too much craze and hype and misinformation. what I do know is we can't keep draining lakes with datacenters here and letting companies that are willing to heel turn on a whim basically control the output of all companies. that's not going to work, we collectively have to find a way to make local inference the path forward.
everyone's foot is on the gas. all orgs, all execs, all peoples working jobs. there's no putting this stuff down, and it's exhausting but we have to be using claude like _right now_. pretty much every company is already completely locked in to openai/gemini/claude and for some unfortunate ones copilot. this was a utility vendor lock in capture that happened faster than anything ive ever seen in my life & I already am desperate for a way to get my org out of this.
hakfoo 3 hours ago
dewarrn1 6 hours ago
I'm hopeful that new efficiencies in training (Deepseek et al.), the impressive performance of smaller models enhanced through distillation, and a glut of past-their-prime-but-functioning GPUs all converge make good-enough open/libre models cheap, ubiquitous, and less resource-intensive to train and run.
tossandthrow 7 hours ago
The lock in is so incredibly poor. I could switch to whatever provider in minuets.
But it requires that one does not do something stupid.
Eg. For recurring tasks: keep the task specification in the source code and just ask Claude to execute it.
The same with all documentation, etc.
leonidasv 5 hours ago
>perhaps we can come up with something like the "linux/postgres/git/http/etc" of the LLMs
I fear that this may not be feasible in the long term. The open-model free ride is not guaranteed to continue forever; some labs offer them for free for publicity after receiving millions in VC grants now, but that's not a sustainable business model. Models cost millions/billions in infrastructure to train. It's not like open-source software where people can just volunteer their time for free; here we are talking about spending real money upfront, for something that will get obsolete in months.
Current AI model "production" is more akin to an industrial endeavor than open-source arrangements we saw in the past. Until we see some breakthrough, I'm bearish on "open models will eventually save us from reliance on big companies".
falkensmaize 4 hours ago
"get obsolete in months"
If you mean obsolete in the sense of "no longer fit for purpose" I don't think that's true. They may become obsolete in terms of "can't do hottest new thing" but that's true of pretty much any technology. A capable local model that can do X will always be able to do X, it just may not be able to do Y. But if X is good enough to solve your problem, why is a newer better model needed?
I think if we were able to achieve ~Opus 4.6 level quality in a local model that would probably be "good enough" for a vast number of tasks. I think it's debatable whether newer models are always better - 4.7 seems to be somewhat of a regression for example.
aliljet 7 hours ago
What open models are truly competing with both Claude Code and Opus 4.7 (xhigh) at this stage?
parinporecha 6 hours ago
I've had a good experience with GLM-5.1. Sure it doesn't match xhigh but comes close to 4.6 at 1/3rd the cost
Someone1234 6 hours ago
That's a lame attitude. There are local models that are last year's SOTA, but that's not good enough because this year's SOTA is even better yet still...
I've said it before and I'll say it again, local models are "there" in terms of true productive usage for complex coding tasks. Like, for real, there.
The issue right now is that buying the compute to run the top end local models is absurdly unaffordable. Both in general but also because you're outbidding LLM companies for limited hardware resources.
You have a $10K budget, you can legit run last year's SOTA agentic models locally and do hard things well. But most people don't or won't, nor does it make cost effective sense Vs. currently subsidized API costs.
gbro3n 6 hours ago
aliljet 6 hours ago
HWR_14 5 hours ago
wellthisisgreat 5 hours ago
esafak 6 hours ago
GLM 5.1 competes with Sonnet. I'm not confident about Opus, though they claim it matches that too.
ojosilva 5 hours ago
GaryBluto 6 hours ago
> open models
Google just released Gemma 4, perhaps that'd be worth a try?
ben8bit 7 hours ago
Any recommendations on good open ones? What are you using primarily?
culi 6 hours ago
LMArena actually has a nice Pareto distribution of ELO vs price for this
model elo $/M
---------------------------------------
glm-5.1 1538 2.60
glm-4.7 1440 1.41
minimax-m2.7 1422 0.97
minimax-m2.1-preview 1392 0.78
minimax-m2.5 1386 0.77
deepseek-v3.2-thinking 1369 0.38
mimo-v2-flash (non-thinking) 1337 0.24
https://arena.ai/leaderboard/code?viewBy=plot&license=open-s...logicprog 5 hours ago
blahblaher 7 hours ago
qwen3.5/3.6 (30B) works well,locally, with opencode
zozbot234 7 hours ago
pitched 7 hours ago
equasar 2 hours ago
jherdman 7 hours ago
cpursley 7 hours ago
cmrdporcupine 7 hours ago
GLM 5.1 via an infra provider. Running a competent coding capable model yourself isn't viable unless your standards are quite low.
myaccountonhn 6 hours ago
DeathArrow 6 hours ago
I am using GLM 5.1 and MiniMax 2.7.
sourya4 4 hours ago
yep!! had similar thoughts on the the "linux/postgres/git/http/etc" of the LLMs
made a HN post of my X article on the lock-in factor and how we should embrace the modular unix philosophy as a way out: https://news.ycombinator.com/item?id=47774312
crgk 5 hours ago
Who’s your “we,” if you don’t mind sharing? I’m curious to learn more about companies/organizations with this perspective.
finghin 5 hours ago
I’m imagining a (private/restricted) tracker style system where contributors “seed” compute and users “leech”.
atleastoptimal 4 hours ago
Open models are only near SOTA because of distillation from closed models.
i_love_retros 7 hours ago
> we don't want a hard dependency on another multi-billion dollar company just to write software
My manager doesn't even want us to use copilot locally. Now we are supposed to only use the GitHub copilot cloud agent. One shot from prompt to PR. With people like that selling vendor lock in for them these companies like GitHub, OpenAI, Anthropic etc don't even need sales and marketing departments!
tossandthrow 7 hours ago
You are aware that using eg. Github copilot is not one shot? It will start an agentic loop.
dgellow 6 hours ago
giancarlostoro 5 hours ago
> I think that's the way forward. Actually it would be great if everybody would put more focus on open models,
I'm still surprised top CS schools are not investing in having their students build models, I know some are, but like, when's the last time we talked about a model not made by some company, versus a model made by some college or university, which is maintained by the university and useful for all.
It's disgusting that OpenAI still calls itself "Open AI" when they aren't truly open.
Frannky 6 hours ago
Opencode go with open models is pretty good
sergiotapia 6 hours ago
I can recommend this stack. It works well with the existing Claude skills I had in my code repos:
1. Opencode
2. Fireworks AI: GLM 5.1
And it is SIGNIFICANTLY cheaper than Claude. I'm waiting eagerly for something new from Deepseek. They are going to really show us magic.
dirasieb 6 hours ago
it is also significantly less capable than claude
dakiol 5 hours ago
OrvalWintermute 5 hours ago
I'm increasingly thinking the same as our spend on tokens goes up.
If you have HPC or Supercompute already, you have much of the expertise on staff already to expand models locally, and between Apple Silicon and Exo there are some amazingly solutions out there.
Now, if only the rumors about Exo expanding to Nvidia are true..
somewhereoutth 5 hours ago
My understanding is that the major part of the cost of a given model is the training - so open models depend on the training that was done for frontier models? I'm finding hard to imagine (e.g.) RLHF being fundable through a free software type arrangement.
zozbot234 5 hours ago
No, the training between proprietary and open models is completely different. The speculation that open models might be "distilled" from proprietary ones is just that, speculation, and a large portion of it is outright nonsense. It's physically possible to train on chat logs from another model but that's not "distilling" anything, and it's not even eliciting any real fraction of the other model's overall knowledge.
tehjoker 4 hours ago
DeathArrow 6 hours ago
>perhaps we can come up with something like the "linux/postgres/git/http/etc" of the LLMs: something we all can benefit from while it not being monopolized by a single billionarie company
Training and inference costs so we would have to pay for them.
groundzeros2015 6 hours ago
Developing linux/postgres/git also costs, and so do the computers and electricity they use.
SilverElfin 6 hours ago
Is that why they are racing to release so many products? It feels to me like they want to suck up the profits from every software vertical.
Bridged7756 6 hours ago
Yeah it seems so. Anthropic has entered the enshittification phase. They got people hooked onto their SOTAs so it's now time to keep releasing marginal performance increase models at 40% higher token price. The problem is that both Anthropic and OpenAI have no other income other than AI. Can't Google just drown them out with cheaper prices over the long run? It seems like an attrition battle to me.
wahnfrieden 4 hours ago
or just use codex
sky2224 5 hours ago
This is part of the reason why I'm really worried that this is all going to result in a greater economic collapse than I think people are realizing.
I think companies that are shelling out the money for these enterprise accounts could honestly just buy some H100 GPUs and host the models themselves on premises. Github CoPilot enterprise charges $40 per user per month (this can vary depending on your plan of course), but at this price for 1000 users that comes out to $480,000 a year. Maybe I'm missing something, but that's roughly what you're going to be spending to get a full fledged hosting setup for LLMs.
subarctic 4 hours ago
Out of curiosity, how many concurrent users could you get with a hosting setup at that price? If let's say 10% of those 1000 users were using it at the same time would it handle it? What about 30% or 100%?
merlinoa 4 hours ago
Most companies don't want to host it themselves. They want someone to do it for them, and they are happy to pay for it. If it makes their lives easier and does not add complexity, then it has a lot of value.
couchdb_ouchdb 6 hours ago
Comments here overall do not reflect my experience -- i'm puzzled how the vast majority are using this technology day to day. 4.7 is absolute fire and an upgrade on 4.6.
Gareth321 3 hours ago
I suspect the distinction is API vs subscription. The app has some kind of very restrictive system prompt which appears to heavily restrict compute without some creative coaxing. API remains solid. So if you're using OpenCode or some other harness with an API key, that's why you're still having a good time.
autoconfig 7 hours ago
My initial experience with Opus 4.7 has been pretty bad and I'm sticking to Codex. But these results are meaningless without comparing outcome. Wether the extra token burn is bad or not depends on whether it improves some quality / task completion metric. Am I missing something?
zuzululu 6 hours ago
Same I was excited about 4.7 but seeing more anecdotes to conclude its not big of a boost to justify the extra tokenflatino
Sticking with codex. Also GPT 5.5 is set to come next week.
templar_snow 7 hours ago
Brutal. I've been noticing that 4.7 eats my Max Subscription like crazy even when I do my best to juggle tasks (or tell 4.7 to use subagents with) Sonnet 4.6 Medium and Haiku. Would love to know if anybody's found ideal token-saving approaches.
copperx 6 hours ago
I haven't seen a noticeable difference BUT I've been always using the context mode plugin.
templar_snow 5 hours ago
You mean this? https://github.com/mksglu/context-mode Is it actually good or is this an ad?
FireBeyond 6 hours ago
What plugin is this?
vidarh 6 hours ago
tailscaler2026 7 hours ago
Subsidies don't last forever.
pitched 7 hours ago
Running an open like Kimi constantly for an entire month will cost around 100-200$, being roughly equal to a pro-tier subscription. This is not my estimate so I’m more than open to hearing refutations. Kimi isn’t at all Opus-level intelligent but the models are roughly evenly sized from the guesses I’ve seen. So I don’t think it’s the infra being subsidized as much as it’s the training.
nothinkjustai 7 hours ago
Kimi costs 0.3/$1.72 on OpenRouter, $200 for that gives you way more than you would get out of a $200 Claude subscription. There are also various subscription plans you can use to spend even less.
varispeed 6 hours ago
How do you get anything sensible out of Kimi?
senordevnyc 6 hours ago
I’m using Composer 2, Cursor’s model they built on top of Kimi, and it’s great. Not Opus level, but I’m finding many things don’t need Opus level.
RevEng 34 minutes ago
smt88 7 hours ago
Tell that to oil and defense companies.
If tech companies convince Congress that AI is an existential issue (in defense or even just productivity), then these companies will get subsidies forever.
andai 7 hours ago
Yeah, USA winning on AI is a national security issue. The bubble is unpoppable.
And shafting your customers too hard is bad for business, so I expect only moderate shafting. (Kind of surprised at what I've been seeing lately.)
danny_codes 5 hours ago
gadflyinyoureye 7 hours ago
I've been assuming this for a while. If I have a complex feature, I use Opus 4.6 in copilot to plan (3 units of my monthly limit). Then have Grok or Gemini (.25-.33) of my monthly units to implement and verify the work. 80% of the time it works every time. Leave me plenty of usage over the month.
sgc 4 hours ago
I have a very newcomer-type question. What is the output format of your plan such that you can break context and get the other LLM to produce satisfactory results? What level of details is in the plan, bullet points, pseudo-code, or somewhere in the middle?
andai 7 hours ago
Yeah I've been arriving at the same thing. The other models give me way more usage but they don't seem to have enough common sense to be worth using as the main driver.
If I can have Claude write up the plan, and the other models actually execute it, I'd get the best of both worlds.
(Amusingly, I think Codex tolerates being invoked by Claude (de facto tolerated ToS violation), but not the other way around.)
zozbot234 5 hours ago
anabranch 8 hours ago
I wanted to better understand the potential impact for the tokenizer change from 4.6 and 4.7.
I'm surprised that it's 45%. Might go down (?) with longer context answers but still surprising. It can be more than 2x for small prompts.
pawelduda 7 hours ago
Not very encouraging for longer use, especially that the longer the conversation, the higher the chance the agent will go off the rails
throwatdem12311 6 hours ago
Price is now getting to be more in line with the actual cost. Th models are dumber, slower and more expensive than what we’ve been paying up until now. OpenAI will do it too, maybe a bit less to avoid pissing people off after seeing backlash to Anthropic’s move here. Or maybe they won’t make it dumber but they’ll increase the price while making a dumber mode the baseline so you’re encouraged to pay more. Free ride is over. Hope you have 30k burning a hole in your pocket to buy a beefy machine to run your own model. I hear Mac Studios are good for local inference.
KellyCriterion 7 hours ago
Yesterday, I killed my weekly limit with just three prompts and went into extra usage for ~18USD on top
atleastoptimal 4 hours ago
The whole version naming for models is very misleading. 4 and 4.1 seem to come from a different "line" than 4.5 and 4.6, and likewise 4.7 seems like a new shape of model altogether. They aren't linear stepwise improvements, but I think overall 4.7 is generally "smarter" just based on conversational ability.
fathermarz 6 hours ago
I have been seeing this messaging everywhere and I have not noticed this. I have had the inverse with 4.7 over 4.6.
I think people aren’t reading the system cards when they come out. They explicitly explain your workflow needs to change. They added more levels of effort and I see no mention of that in this post.
Did y’all forget Opus 4? That was not that long ago that Claude was essentially unusable then. We are peak wizardry right now and no one is talking positively. It’s all doom and gloom around here these days.
gck1 15 minutes ago
> They explicitly explain your workflow needs to change
How about - don't break my workflow unless the change is meaningful?
While we're at it, either make y in x.y mean "groundbreaking", or "essentially same, but slightly better under some conditions". The former justifies workflow adjustments, the latter doesn't.
RevEng 30 minutes ago
I have used nothing but Sonnet and composer for a year and they work fine. LLMs were certainly not unusable before and Opus is certainly not necessary, especially considering the cost. People get excited by new records on benchmarks but for most day to day work the existing models are sufficient and far more efficient.
jimkleiber 6 hours ago
I wonder if this is like when a restaurant introduces a new menu to increase prices.
Is Opus 4.7 that significantly different in quality that it should use that much more in tokens?
I like Claude and Anthropic a lot, and hope it's just some weird quirk in their tokenizer or whatnot, just seems like something changed in the last few weeks and may be going in a less-value-for-money direction, with not much being said about it. But again, could just be some technical glitch.
hopfenspergerj 6 hours ago
You can't accidentally retrain a model to use a different tokenizer. It changes the input vectors to the model.
napolux 7 hours ago
Token consumption is huge compared to 4.6 even for smaller tasks. Just by "reasoning" after my first prompt this morning I went over 50% over the 5 hours quota.
gck1 4 hours ago
Anthropic is playing a strange game. It's almost like they want you to cancel the subscription if you're an active user and only subscribe if you only use it once per month to ask what the weather in Berlin is.
First they introduce a policy to ban third party clients, but the way it's written, it affects claude -p too, and 3 months later, it's still confusing with no clarification.
Then they hide model's thinking, introduce a new flag which will still show summaries of thinking, which they break again in the next release, with a new flag.
Then they silently cut the usage limits to the point where the exact same usage that you're used to consumes 40% of your weekly quota in 5 hours, but not only they stay silent for entire 2 weeks - they actively gaslight users saying they didn't change anything, only to announce later that they did, indeed change the limits.
Then they serve a lobotomized model for an entire week before they drop 4.7, again, gaslighting users that they didn't do that.
And then this.
Anthropic has lost all credibility at this point and I will not be renewing my subscription. If they can't provide services under a price point, just increase the price or don't provide them.
EDIT: forgot "adaptive thinking", so add that too. Which essentially means "we decide when we can allocate resources for thinking tokens based on our capacity, or in other words - never".
bobjordan 7 hours ago
I've spent the past 4+ months building an internal multi-agent orchestrator for coding teams. Agents communicate through a coordination protocol we built, and all inter-agent messages plus runtime metrics are logged to a database.
Our default topology is a two-agent pair: one implementer and one reviewer. In practice, that usually means Opus writing code and Codex reviewing it.
I just finished a 10-hour run with 5 of these teams in parallel, plus a Codex run manager. Total swarm: 5 Opus 4.7 agents and 6 Codex/GPT-5.4 agents.
Opus was launched with:
`export CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=35 claude --dangerously-skip-permissions --model 'claude-opus-4-7[1M]' --effort high --thinking-display summarized`
Codex was launched with:
`codex --dangerously-bypass-approvals-and-sandbox --profile gpt-5-4-high`
What surprised me was usage: after 10 hours, both my Claude Code account and my Codex account had consumed 28% of their weekly capacity from that single run.
I expected Claude Code usage to be much higher. Instead, on these settings and for this workload, both platforms burned the same share of weekly budget.
So from this datapoint alone, I do not see an obvious usage-efficiency advantage in switching from Opus 4.7 to Codex/GPT-5.4.
pitched 6 hours ago
I just switched fully into Codex today, off of Claude. The higher usage limits were one factor but I’m also working towards a custom harness that better integrates into the orchestrator. So the Claude TOS was also getting in the way.
ausbah 7 hours ago
is it really unthinkable that another oss/local model will be released by deepseek, alibaba, or even meta that once again give these companies a run for their money
zozbot234 7 hours ago
> is it really unthinkable that another oss/local model will be released by deepseek, alibaba, or even meta that once again give these companies a run for their money
Plenty of OSS models being released as of late, with GLM and Kimi arguably being the most interesting for the near-SOTA case ("give these companies a run for their money"). Of course, actually running them locally for anything other than very slow Q&A is hard.
rectang 7 hours ago
For my working style (fine-grained instructions to the agent), Opus 4.5 is basically ideal. Opus 4.6 and 4.7 seem optimized for more long-running tasks with less back and forth between human and agent; but for me Opus 4.6 was a regression, and it seems like Opus 4.7 will be another.
This gives me hope that even if future versions of Opus continue to target long-running tasks and get more and more expensive while being less-and-less appropriate for my style, that a competitor can build a model akin to Opus 4.5 which is suitable for my workflow, optimizing for other factors like cost.
DeathArrow 4 hours ago
Have you tried GLM 5.1?
amelius 7 hours ago
I'm betting on a company like Taalas making a model that is perhaps less capable but 100x as fast, where you could have dozens of agents looking at your problem from all different angles simultaneously, and so still have better results and faster.
andai 7 hours ago
Yeah, it's a search problem. When verification is cheap, reducing success rate in exchange for massively reducing cost and runtime is the right approach.
never_inline 6 hours ago
100ms 4 hours ago
I'm excited for Taalas, but the worry with that suggestion is that it would blow out energy per net unit of work, which kills a lot of Taalas' buzz. Still, it's inevitable if you make something an order of magnitude faster, folk will just come along and feed it an order of magnitude more work. I hope the middleground with Taalas is a cottage industry of LLM hosts with a small-mid sized budget hosting last gen models for quite cheap. Although if they're packed to max utilisation with all the new workloads they enable, latency might not be much better than what we already have today
embedding-shape 7 hours ago
Nothing is unthinkable, I could think of Transformers.V2 that might look completely different, maybe iterations on Mamba turns out fruitful or countless of other scenarios.
pitched 7 hours ago
Now that Anthropic have started hiding the chain of thought tokens, it will be a lot harder for them
zozbot234 7 hours ago
Anthropic and OpenAI never showed the true chain of thought tokens. Ironically, that's something you only get from local models.
slowmovintarget 7 hours ago
Qwen released a new model the same day (3.6). The headline was kind of buried by Anthropic's release, though.
casey2 5 hours ago
This regression put Anthropic behind Chinese models actually.
ianberdin 5 hours ago
Opus 4.6 is the main model on https://playcode.io.
Not a secret, the model is the best on the world. Yet it is crazy expensive and this 35% is huge for us. $10,000 becomes $13,500. Don’t forget, anthropic tokenizer also shows way more than other providers.
We have experimented a lot with GLM 5.1. It is kinda close, but with downsides: no images, max 100K adequate context size and poor text writing. However, a great designer. So there is no replacement. We pray.
sneak 5 hours ago
How much human developer can you buy for that $13.5k?
They’ve got us by the balls and they know it.
razodactyl 7 hours ago
If anyone's had 4.7 update any documents so far - notice how concise it is at getting straight to the point. It rewrote some of my existing documentation (using Windsurf as the harness), not sure I liked the decrease in verbosity (removed columns and combined / compressed concepts) but it makes sense in respect to the model outputting less to save cost.
To me this seems more that it's trained to be concise by default which I guess can be countered with preference instructions if required.
What's interesting to me is that they're using a new tokeniser. Does it mean they trained a new model from scratch? Used an existing model and further trained it with a swapped out tokeniser?
The looped model research / speculation is also quite interesting - if done right there's significant speed up / resource savings.
andai 7 hours ago
Interesting. In conversational use, it's noticeably more verbose.
BrianneLee011 2 hours ago
We should clarify 'Scaling up' here. Does higher token consumption actually correlate with better accuracy, or are we just increasing overhead?
coldtea 7 hours ago
This, the push towards per-token API charging, and the rest are just a sign of things to come when they finally establish a moat and full monoply/duopoly, which is also what all the specialized tools like Designer and integrations are about.
It's going to be a very expensive game, and the masses will be left with subpar local versions. It would be like if we reversed the democratization of compilers and coding tooling, done in the 90s and 00s, and the polished more capable tools are again all proprietary.
danny_codes 5 hours ago
I doubt that’s the case. My guess is we’ll hit asymptomatic returns from transformers, but price-to-train will fall at moore’s law.
So over time older models will be less valuable, but new models will only be slightly better. Frontier players, therefore, are in a losing business. They need to charge high margins to recoup their high training costs. But latecomers can simply train for a fraction of the cost.
Since performance is asymptomatic, eventually the first-mover advantage is entirely negligible and LLMs become simple commodity.
The only moat I can see is data, but distillation proves that this is easy to subvert.
There will probably be a window though where insiders get very wealthy by offloading onto retail investors, who will be left with the bag.
coldtea 3 hours ago
>I doubt that’s the case. My guess is we’ll hit asymptomatic returns from transformers, but price-to-train will fall at moore’s law.
There hasn't been a real Moore's law for a good while even before LLMs.
And memory isn't getting less expensive either...
quux 7 hours ago
If only there were an Open AI company who's mandate, built into the structure of the company, were to make frontier models available to everyone for the good of humanity.
Oh well
slowmovintarget 7 hours ago
Things used to be better... really.
OpenAI was built as you say. Google had a corporate motto of "Don't be evil" which they removed so they could, um, do evil stuff without cognitive dissonance, I guess.
This is the other kind of enshitification where the businesses turn into power accumulators.
throwaway041207 7 hours ago
Yep, between this and the pricing for the code review tool that was released a couple weeks ago (15-25 a review), and the usage pricing and very expensive cost of Claude Design, I do wonder if Anthropic is making a conscious, incremental effort to raise the baseline for AI engineering tasks, especially for enterprise customers.
You could call it a rug pull, but they may just be doing the math and realize this is where pricing needs to shift to before going public.
zozbot234 7 hours ago
There's been speculation that the code review might actually be Mythos. It would seem to explain the cost.
monkpit 6 hours ago
Does this have anything to do with the default xhigh effort?
QuadrupleA 6 hours ago
One thing I don't see often mentioned - OpenAI API's auto token caching approach results in MASSIVE cost savings on agent stuff. Anthropic's deliberate caching is a pain in comparison. Wish they'd just keep the KV cache hot for 60 seconds or so, so we don't have to pay the input costs over and over again, for every growing conversation turn.
aray07 7 hours ago
Came to a similar conclusion after running a bunch of tests on the new tokenizer
It was on the higher end of Anthropics range - closer to 30-40% more tokens
https://www.claudecodecamp.com/p/i-measured-claude-4-7-s-new...
alphabettsy 6 hours ago
I’m trying to understand how this is useful information on its own?
Maybe I missed it, but it doesn’t tell you if it’s more successful for less overall cost?
I can easily make Sonnet 4.6 cost way more than any Opus model because while it’s cheaper per prompt it might take 10x more rounds (or never) solve a problem.
senordevnyc 6 hours ago
Everything in AI moves super quickly, including the hivemind. Anthropic was the darling a few weeks ago after the confrontation with the DoD, but now we hate them because they raised their prices a little. Join us!
ivanfioravanti 6 hours ago
Probably due to the new tokenizer: https://www.claudecodecamp.com/p/i-measured-claude-4-7-s-new...
nmeofthestate 6 hours ago
Is this a weird way of saying Opus got "cheaper" somehow from 4.6 to 4.7?
ben8bit 7 hours ago
Makes me think the model could actually not even be smarter necessarily, just more token dependent.
hirako2000 7 hours ago
Asking a seller to sell less.
That's an incentive difficult to reconcile with the user's benefit.
To keep this business running they do need to invest to make the best model, period.
It happens to be exactly what Anthropic's strategy is. That and great tooling.
subscribed 4 hours ago
But they're clearly oversubscribed, massively.
And they're selling less and less (suddenly 5 hour window lasts 1 hour on the similar tasks it lasted 5 hours a week ago), so IMO they're scamming.
I hope many people are making notes and will raise heat soon.
l5870uoo9y 7 hours ago
My impression the reverse is true when upgrading to GPT-5.4 from GPT-5; it uses fewer tokens(?).
andai 7 hours ago
But with the same tokenizer, right?
The difference here is Opus 4.7 has a new tokenizer which converts the same input text to a higher number of tokens. (But it costs the same per token?)
> Claude Opus 4.7 uses a new tokenizer, contributing to its improved performance on a wide range of tasks. This new tokenizer may use roughly 1x to 1.35x as many tokens when processing text compared to previous models (up to ~35% more, varying by content), and /v1/messages/count_tokens will return a different number of tokens for Claude Opus 4.7 than it did for Claude Opus 4.6.
> Pricing remains the same as Opus 4.6: $5 per million input tokens and $25 per million output tokens.
ArtificialAnalysis reports 4.7 significantly reduced output tokens though, and overall ~10% cheaper to run the evals.
I don't know how well that translates to Claude Code usage though, which I think is extremely input heavy.
silverwind 7 hours ago
Still worth it imho for important code, but it shows that they are hitting a ceiling while trying to improve the model which they try to solve by making it more token-inefficient.
eezing 5 hours ago
Not sure if this equates to more spend. Smarter models make fewer mistakes and thus fewer round trips.
blahblaher 7 hours ago
Conspiracy time: they released a new version just so hey could increase the price so that people wouldn't complain so much along the lines of "see this is a new version model, so we NEED to increase the price") similar to how SaaS companies tack on some shit to the product so that they can increase prices
willis936 7 hours ago
The result is the same: they lose their brand of producing quality output. However the more clever the maneuver they try to pull off the more clear it is to their customers that they are not earning trust. That's what will matter at the end of this. Poor leadership at Claude.
operatingthetan 5 hours ago
They are trying to pull a rabbit out of a hat. Not surprising that is their SOP given that AI in concept is an attempt to do the very same thing.
cooldk 5 hours ago
Anthropic may have its biases, but its product is undeniably excellent.
erelong 3 hours ago
was shocked to see phone verification roll out like last month as well... yikes
axeldunkel 7 hours ago
the better the tokenizer maps text to its internal representation, the better the understanding of the model what you are saying - or coding! But 4.7 is much more verbose in my experience, and this probably drives cost/limits a lot.
Shailendra_S 7 hours ago
45% is brutal if you're building on top of these models as a bootstrapped founder. The unit economics just don't work anymore at that price point for most indie products.
What I've been doing is running a dual-model setup — use the cheaper/faster model for the heavy lifting where quality variance doesn't matter much, and only route to the expensive one when the output is customer-facing and quality is non-negotiable. Cuts costs significantly without the user noticing any difference.
The real risk is that pricing like this pushes smaller builders toward open models or Chinese labs like Qwen, which I suspect isn't what Anthropic wants long term.
OptionOfT 7 hours ago
That's the risk you take on.
There are 2 things to consider:
* Time to market.
* Building a house on someone else's land.
You're balancing the 2, hoping that you win the time to market, making the second point obsolete from a cost perspective, or you have money to pivot to DIY.c0balt 7 hours ago
One could reconsider whether building your business on top of a model without owning the core skills to make your product is viable regardless.
A smaller builder might reconsider (re)acquiring relevant skills and applying them. We don't suddenly lose the ability to program (or hire someone to do it) just because an inference provider is available.
duped 7 hours ago
> if you're building on top of these models as a bootstrapped founder
This is going to be blunt, but this business model is fundamentally unsustainable and "founders" don't get to complain their prospecting costs went up. These businesses are setting themselves up to get Sherlocked.
The only realistic exit for these kinds of businesses is to score a couple gold nuggets, sell them to the highest bidder, and leave.
dackdel 7 hours ago
releases 4.8 and deletes everything else. and now 4.8 costs 500% more than 4.7. i wonder what it would take for people to start using kimi or qwen or other such.
justindotdev 7 hours ago
i think it is quite clear that staying with opus 4.6 is the way to go, on top of the inflation, 4.7 is quite... dumb. i think they have lobotomized this model while they were prioritizing cybersecurity and blocking people from performing potentially harmful security related tasks.
bcherny 7 hours ago
Hey, Boris from the Claude Code team here. People were getting extra cyber warnings when using old versions of Claude Code with Opus 4.7. To fix it, just run claude update to make sure you're on the latest.
Under the hood, what was happening is that older models needed reminders, while 4.7 no longer needs it. When we showed these reminders to 4.7 it tended to over-fixate on them. The fix was to stop adding cyber reminders.
More here: https://x.com/ClaudeDevs/status/2045238786339299431
bakugo 7 hours ago
How do you justify the API and web UI versions of 4.7 refusing to solve NYT Connections puzzles due to "safety"?
templar_snow 7 hours ago
matheusmoreira 3 hours ago
What is your response to:
> 4.7 is quite... dumb. i think they have lobotomized this model
Is adaptive thinking still broken? Why was the option to disable it taken away?
vessenes 7 hours ago
4.7 is super variable in my one day experience - it occasionally just nails a task. Then I'm back to arguing with it like it's 2023.
aenis 7 hours ago
My experience as well, unfortunately. I am really looking forward to reading, in a few years, a proper history of the wild west years of AI scaling. What is happening in those companies at the moment must be truly fascinating. How is it possible, for instance, that I never, ever, had an instance of not being able to use Claude despite the runaway success it had, and - i'd guess - expotential increase in infra needs. When I run production workloads on vertex or bedrock i am routinely confronted with quotas, here - it always works.
dgellow 7 hours ago
That has been my Friday experience as well… very frustrating to go back to the arguing, I forgot how tense that makes me feel
ai_slop_hater 7 hours ago
Does anyone know what changed in the tokenizer? Does it output multiple tokens for things that were previously one token?
quux 7 hours ago
It must, if it now outputs more tokens than 4.6's tokenizer for the same input. I think the announcement and model cards provide a little more detail as to what exactly is different
gverrilla 5 hours ago
Yeah I'm seriously considering dropping my Max subscription, unless they do something in the next few days - something like dropping Sonnet 4.7 cheap and powerful.
varispeed 6 hours ago
I spent one day with Opus 4.7 to fix a bug. It just ran in circles despite having the problem "in front of its eyes" with all supporting data, thorough description of the system, test harness that reproduces the bug etc. While I still believe 4.7 is much "smarter" than GPT-5.4 I decided to give it ago. It was giving me dumb answers and going off the rails. After accusing it many times of being a fraud and doing it on purpose so that I spend more money, it fixed the bug in one shot.
Having a taste of unnerfed Opus 4.6 I think that they have a conflict of interest - if they let models give the right answer first time, person will spend less time with it, spend less money, but if they make model artificially dumber (progressive reasoning if you will), people get frustrated but will spend more money.
It is likely happening because economics doesn't work. Running comparable model at comparable speed for an individual is prohibitively expensive. Now scale that to millions of users - something gotta give.
fmckdkxkc 3 hours ago
I enjoy using Claude but I find the vibing stuff starts to cause source-code amnesia. Even if I design something and put forth a thoughtful plan, the more I increase my output the less I feel the “vibes”.
It’s funny everyone says “the cost will just go down” with AI but I don’t know.
We need to keep the open source models alive and thriving. Oh, but wait the AI companies are buying all the hardware.
DeathArrow 6 hours ago
We (my wallet and I) are pretty happy with GLM 5.1 and MiniMax 2.7.
QuadrupleA 6 hours ago
Definitely seems like AI money got tight the last month or two - that the free beer is running out and enshittification has begun.
micromacrofoot 7 hours ago
The latest qwen actually performs a little better for some tasks, in my experience
latest claude still fails the car wash test
reddit_clone 4 hours ago
Not just _wrong_. It is confused! It is actually right in the second sentence. This was Friday, Opus 4.6.
>I want to wash my car. The car wash is 50 meters away. Should I walk or drive?
Walk. It's 50 meters — you're going there to clean the car anyway, so drive it over if it needs washing, but if you're just dropping it off or it's a self-service place, walking is fine for that distance.
zozbot234 4 hours ago
This is actually a good diagnostic of whether the model is skimping on the thinking loop. Try raising thinking effort and it should get it right. Of course, if you're running this in a coding harness with a whole lot of extraneous context, the model will be awfully confused as to what it should be thinking about.
fny 7 hours ago
I'm going to suggest what's going on here is Hanlon's Razor for models: "Never attribute to malice that which is adequately explained by a model's stupidity."
In my opinion, we've reached some ceiling where more tokens lead only to incremental improvements. A conspiracy seems unlikely given all providers are still competing for customers and a 50% token drives infra costs up dramatically too.
willis936 7 hours ago
Never attribute to incompetence what is sufficiently explained by greed.
rvz 42 minutes ago
Correct.
mvkel 7 hours ago
The cope is real with this model. Needing an instruction manual to learn how to prompt it "properly" is a glaring regression.
The whole magic of (pre-nerfed) 4.6 was how it magically seemed to understand what I wanted, regardless of how perfectly I articulated it.
Now, Anth says that needing to explicitly define instructions are as a "feature"?!
bparsons 7 hours ago
Had a pretty heavy workload yesterday, and never hid the limit on claude code. Perhaps they allowed for more tokens for the launch?
Claude design on the other hand seemed to eat through (its own separate usage limit) very fast. Hit the limit this morning in about 45 mins on a max plan. I assume they are going to end up spinning that product off as a separate service.
therobots927 7 hours ago
Wow this is pretty spectacular. And with the losses anthro and OAI are running, don’t expect this trend to change. You will get incremental output improvements for a dramatically more expensive subscription plan.
falcor84 7 hours ago
Indeed, and if we accept the argument of this tech approaching AGI, we should expect that within x years, the subscription cost may exceed the salary cost of a junior dev.
To be clear, I'm not saying that it's a good thing, but it does seem to be going in this direction.
dgellow 7 hours ago
If LLMs do reach AGI (assuming we have an actual agreed upon definition), it would make sense to pay way more than a junior salary. But also, LLMs won’t give us AGI (again, assuming we have an actual, meaningful definition)
therobots927 6 hours ago
I absolutely do not accept that argument. It’s clear models hit a plateau roughly a year ago and all incremental improvements come at an increasingly higher cost.
And junior devs have never added much value. The first two years of any engineer’s career is essentially an apprenticeship. There’s no value add from have a perpetually junior “employee”.
alekseyrozh 6 hours ago
Is it just me? I don't feel difference between 4.6 and 4.7