Hacker News

by Ryan Harman

Claude Sonnet 5 (anthropic.com)

1223 points by marinesebastian a day ago

Jcampuzano2 a day ago

I'm struggling to understand why I'd ever use this instead of just using a lower effort level for opus given on many of the benchmarks listed the cost per task rises above opus at anything higher than medium effort.

Only thing I can think of is for when someone is out of opus credits. Of course there are API billing use cases but I'd probably still just use opus on low.

itopaloglu83 a day ago

More and more I find myself trying to stop Opus from doing something stupid, and at every turn I need to tell it to stop overcomplicating things.

I think the models are being optimized for wealth extraction from users and companies, instead of solving problems.

I don't know why Opus would try to create an entire library when I told it specifically to do something simple that would take 2-3 lines of Python.

scrollop 12 hours ago

"I think the models are being optimized for wealth extraction from users and companies, instead of solving problems."

YES! They introduced the new tokenizer to increase token generation by upto 33%.

On top of this, Anthropic are generating almost twice as much revenue per paid user than openai - whilst their subscriptions have lower usage limits than openai's:

https://youtu.be/gK-7TKC7kvY?si=kx0qPE1rw-UCI-Jn&t=650

Traubenfuchs 9 hours ago

ngruhn 8 hours ago

> I think the models are being optimized for wealth extraction from users and companies, instead of solving problems.

I don't think so. Expect that in a market with high vendor lock-in but that's not the case here. The market is extremely competitive and switching cost are near zero. Anthropic can't afford to pull shit like this and sacrifice quality.

comboy 4 hours ago

haspok 5 hours ago

notnaut 6 hours ago

somenameforme 4 hours ago

Terretta 4 hours ago

natty 20 hours ago

> More and more I find myself trying to stop Opus from doing something stupid, and at every turn I need to tell it to stop overcomplicating things

Yeah, that’s my thoughts as well. I feel it’s great for benchmarks and some tasks while in other it tries to spend as much tokens as possible, tries to overcomplicate task and needs seconds or third round of steering that costs. With the scale Anthropic operates I bet it’s huge amount of extra money just to make sure their model works.

Aeolun 19 hours ago

indoordin0saur 17 hours ago

Yeah. Mine really likes to read excess code. I'll ask it questions like "If I move all these three ETL jobs into a subfolder will it break anything?" It'll start with giving me the simple answer but then continue on to consider another question and realize it requires reading my entire other repo that handles all of my cloud's infrastructure. And it'll proceed to read through tens of thousands of lines of terraform.

anygivnthursday 4 hours ago

Terretta 3 hours ago

post-it a day ago

> I don't know why Opus would try to create an entire library when I told it specifically to do something simple that would take 2-3 lines of Python.

Because it reasons in one direction. First it encounters some kind of issue with 2-3 lines of Python that might make it not work, and then it goes onto plan B, which is making a library, but it doesn't circle back and compare the effort of making the library to working around whatever might make the 2-3 lines not work. Except sometimes it does, because it's inscrutable.

benny_s 5 hours ago

My experience with Opus in the last weeks is the opposite. I have the feeling Opus got smarter since they released and blocked Fable. Maybe they got more compute available since a) they finished Training Mythos/Fable and b) couldn't provide inference for it?

anygivnthursday 4 hours ago

Traubenfuchs 9 hours ago

It's really bad when you let opus do investigations on broken java or infrastructure stuff. It starts decompiling .jar, sometimes multiple versions of the same dependency, reading every single kubernetes/terraform file and loading all the logs and info kubectl offers.

anygivnthursday 4 hours ago

3ffs 17 hours ago

There were many of us who predicted and saw this months ago.

Should I refer to those who are only realising this now as stupid? I believe so.

Its not wealth extraction btw - the correct economic term is capturing/extracting surplus. They have a wide range of schemes - quality discrimination being one very obvious one.

Swear most of you on here pretend to be soooo smart when you def are not.

nicce a day ago

Older Opus models will likely get deprecated and then over time this is the cheapest model. That is how prices are currently increased.

ChrisLTD 20 hours ago

Yeah... Sonnet becomes the new cheap model, and some Fable class model becomes the more expensive/better one.

theptip 15 hours ago

Wat. Price/perf has been going down massively over the last few years.

darkwater 7 hours ago

phainopepla2 21 hours ago

Looking at some of the agentic coding benchmarks on the system card[0], pages 117-118, it seems that running it at low outperforms Sonnet 4.6 at any level, and is a good deal cheaper as well. So on low it could be a good workhorse for an Opus-planned task.

[0] https://www.anthropic.com/claude-sonnet-5-system-card

port11 2 hours ago

That is certainly an improvement then. Sonnet 4.6 is a great everyday agent for the limited Pro plan, but it’s not much better than M3 or Kimi 2.7, both significantly cheaper models.

c0m47053 19 hours ago

Specific task based benchmarks don't reflect a lot of day to day agentic use cases in my experience. If you are working on a series of discrete tasks and can clear context after each one and move to the next, you might get that sort of efficiency from Opus low effort. I often find that when working through a real problem, iterating and discovering, context length can creep up, and that is where opus tends to get expensive.

licjon 6 hours ago

Is there a router or wrapper that provides a real-time cost estimation for alternative settings? Obviously, you can't predict exact output tokens without running the inference, but a tool that calculates the exact input cost across models and applies a historical average for the output tokens could be useful. Like, you run a task on Sonnet, and it estimates: "Based on your input tokens and a 1:1 output ratio, this would have cost $X on Opus at a low effort level."

annjose 3 hours ago

Not sure of any out-of-the-box tool. But Anthropic has a token count API which gives a near estimate of the input tokens for messages [1].

So this API can be used in a UserPromptSubmit hook [2] in the harness, get the token count for any model, calculate the cost and compare.

[1] https://platform.claude.com/docs/en/build-with-claude/token-...

[2] https://code.claude.com/docs/en/hooks

hollownobody 5 hours ago

I would use Sonnet instead of Opus because it's faster. Isn't it? It's a smaller model

SirMaster a day ago

Maybe it's not for you? I don't pay, so I can't even use Opus... So this is an upgrade over Sonnet 4.6 for me.

theptip 15 hours ago

Are we reading the same chart? They have Sonnet <= high as Pareto dominant on $/perf.

You have to test each task obviously but it is not a bad model on its face.

frozeus 8 hours ago

They have updated it

LUmBULtERA 7 hours ago

southforgeai 6 hours ago

I concur. I already use Opus 4.8 for almost all my tasks and this gives me almost no reason to try Sonnet 5.

fluidcruft 6 hours ago

If you are out of Opus credits, you are out of all model credits.

enraged_camel a day ago

Speed is a huge reason. Sometimes you just need some simple tasks get done fast, and waiting 30-60 seconds for opus to even start thinking can really slow things down.

humanymous a day ago

Opus with low reasoning effort would be faster than Sonnet with high reasoning. So that won't exactly help. I think it would just be what those models are optimized to perform

conradkay a day ago

Wow, seems worse even on price/performance than GLM 5.2, which is only 744b parameters.

From the system card: "On CyberGym vulnerability discovery, Claude Sonnet 5 is less capable than Sonnet 4.6, and far less capable than Opus 4.8 and Mythos 5

As with the other evaluations in this section, these results were achieved with all safeguards turned off. When run with our default mitigations, Sonnet 5 scored a 0 on CyberGym"

sixtyj a day ago

I have tried to rewrite an article with GLM-5.2 and with Sonnet 4.6. Completely different results as LLM is non-deterministic. But GLM-5.2 made a lot of subtle mistakes that needed to be corrected by hand. On the opposite, Sonnet found and corrected all mistakes in the second round.

Similar situation was with planning and coding. GLM-5.2 seems to be good “on paper” but the real usage results was different.

And I am not an attorney for Claude or GLM-5.2… :)

But as I’ve been using LLM models daily since Nov 2022 I have realized that all common tests have to be confirmed in your project - there is no “one model rules them all” - you need to dig out a specific model from that LLM haystack with thousands of models.

Benchmarks help but they start to be similar to fuel consumption specs in car ads - real consumption is different for everybody :)

springtimesun 39 minutes ago

I’m just dipping my feet in the water of local models and I really feel this. I had a simple alignment task (align known quality transcript without timestamps with timestamped but lower quality whisper transcriptions) and I went through 12 rounds of testing across 4 generations of 3 models. The results were all over the map even across versions of the same model. Spin out the task to something as big as coding and wow.

If you have any advice/blogs on doing project specific benchmarks I’d love to hear it. I’m trying, but it’s haphazard at the moment

lawrjone 5 hours ago

We have found similar when plugging GLM 5.2 into actual benchmarks in our product. The open-source models are really dialled into the public benchmarks, until you try them in context you won't have a solid idea of how they perform (Sonnet is a higher quality model than 5.2, both in prose, reasoning, and alignment).

ACCount37 4 hours ago

Traubenfuchs 8 hours ago

> Completely different results as LLM is non-deterministic.

You'd need to produce this like 20 times by each model and then do 2x20x20 cross comparisons by both models and ultimately distill the 2x20x20 comparison results into two reports of how they differ.

In this non deterministic computing future, everything else is voodoo, feelings and "vibes".

jamesrcole 7 hours ago

Retr0id a day ago

Finally, a viable business strategy - sell security-oblivious code monkeys for cheap, then charge premium rates for agents capable of cleaning up the mess.

JacobAsmuth a day ago

I think instead they should sell super hackers and get their product banned instantly and go bankrupt

usef- 19 hours ago

Judging by the events of recent weeks, I'm guessing the low cyber results are why they were allowed to release it

loufe a day ago

Not to single you out, parent commenter, but I really hope the quality of discourse on HN will move past these basic comparisons eventually. It seems like every thread on every model release has the exact same comments.

"Wow, X models is Y% better or worse than Claude Z model on T benchmark"

"That's irrelevant, they're just benchmaxing."

"Not useable for daily coding or agentic workloads, the vibes are totally wrong."

"It's almost as good, and costs a lot less, so I will absolutely use it."

"I cannot imagine justifying using these, as the step change means open models lower costs do not make up for the productivity loss"

I'm an unhappy Anthropic customer and really rooting for open models and non-gatekept intelligence, but how do we move on from this now meme-like model release discourse rigamarole. I do not know what that would be. I don't design LLMs nor benchmarks, and I genuinely appreciate that people do their best to provide information, even if non-perfect here. I'm sure most of you who actively read these comment pages on announcements must feel similarly, though, right?

tripleee a day ago

I'm not sure what else can be said? I've found benchmarks to be a very weak signal for how good/bad the model is, but it's the #1 thing the companies highlight.

20 minutes after the announcement there's no real useful statement that can be made about it.

sejje 5 hours ago

I feel the same way sometimes.

I read a comment earlier that said "I think it's likely that they've scraped all the code regardless of license and trained on it, given how much they scrape the web."

That's what every other comment said like 3 years ago. Where has this guy been?

The trends in discussion about LLMs gets very, very tired--there's little added but personal opinions.

conradkay 20 hours ago

Yeah you definitely have to be skeptical regarding sentiment for open/local model capabilities, since there's bias from what people want to be true.

I generally agree with this in spirit https://www.seangoedecke.com/are-new-models-good/ , but I think you can read Anthropic's results showing Sonnet 5 as almost strictly worse than Opus 4.8 as very credible/meaningful, and then draw comparisons from that

tiahura a day ago

"It's totally obvious they quantitized Claude Z"

sejje 5 hours ago

microtonal a day ago

Claude Sonnet 5 is built to be the most agentic Sonnet model yet. It can make plans, use tools like browsers and terminals, and run autonomously at a level that, just a few months ago, required larger and more expensive models.

I have been using Sonnet 4.6 more than Opus, because I'm mostly doing agent-assisted development and not fully agent-driven development. This announcement does not make me positive, I have found that the more models are optimized for fully agentic development, the worse they get at assisted development and often start doing too much despite very strict/specific instructions.

I have been moving more and more to K2.7 Code and GLM-5.2 the last few weeks. They are often good enough for assistance, very fast, and cheap.

Brendinooo a day ago

Yeah, there's a real opportunity for one of these companies to invest time in a model that's tuned for, to use your term, agent-assisted developement.

Trouble is, everyone inside their buildings seems to believe that no one will be working like that in a year or two.

everforward a day ago

There’s no way to justify their valuations if they get downgraded to a pair programming tool. They need fully agentic stuff to work and replace human engineers to even come close.

Offhand, I’m not even certain whether a model like that could justify the constant retraining we’re doing on the agentic models.

It doesn’t make a lot of sense to spend millions or billions on training to reduce hallucinations by 0.3% if your model assumes a human is in the loop to course-correct them.

keeda a day ago

overgard a day ago

tskj a day ago

sanderjd a day ago

ricardobayes 21 hours ago

JumpCrisscross a day ago

EddieRingle 21 hours ago

pkulak a day ago

And every benchmark is "build GTA-6 from nothing, as a single-page web app".

ricardobayes a day ago

They have to, but also everyone working at 3D printing companies thought "industry 4.0" is going to completely override everything, we are going to print housing and going to print a mug at home and drink coffee out of it.

Today's news that Amazon is hiring 11k interns. I think part of the AI story was used as a convenient excuse to get rid of some "fat" and some covid overhiring and gave companies an out to change course.

rconti 21 hours ago

I wonder how portable the existing models are for different use cases. As good as they are for greenfield development or working in a single or across a few tightly coupled repos, they're absolutely terrible at debugging distributed systems and make incredibly wrong yet extremely confident assertions all the time.

I don't know if it's a matter of just requiring a tiny amount of optimization or wholesale redesign.

popalchemist a day ago

Whether they believe it or not is immaterial. It is the end-goal they want to achieve, because then they own the means of production entirely.

pigpop 21 hours ago

quaverquaver 21 hours ago

jatora a day ago

jambalaya8 a day ago

As I said, working ourselves out of our jobs within the span of a few years.

jerf a day ago

I've been using Kimi K2.6 lately (don't have 2.7 available through blessed work channels yet) for tasks where I already know what it is I want to do and I want to just step through the process in pieces, and it's fine. Do I have to correct it maybe a bit more than Opus? Yeah, but the real cutoff would be between "I have to read every line" and "I can just trust it without reading every line" and for me neither model hits that mark, and I expect it to be a while yet for that. Is it as good as Opus if I want to spit ball about architecture and then convert that to code? No, but I don't have that problem all the time, and it's there if I do need it.

And now in a heavy coding week rather than bumping up against my spend limit by late Wednesday or Thursday I'm comfortably below it all week.

That said if anything I feel like I have to reign in K2.6 much more than Opus, actually. If I want to just ask it a question without it inferring some coding task to immediately start doing, it takes a lot more care to prevent it from just running off half-cocked off of an only 3/4s-cocked idea of my own. I use "plan" mode with both but it's somewhat more defensive with K2.6 than Opus.

nozzlegear a day ago

> I have been moving more and more to K2.7 Code and GLM-5.2 the last few weeks. They are often good enough for assistance, very fast, and cheap.

I've moved completely to local models that I run with my M1 Mac Studio (64gb ram) some time ago. But for the rare times when I feel the local, quantized Qwen3.6 isn't enough, I just connect to Openrouter and use something like Kimi, GLM or Deepseek for a fraction of the price of Anthropic et al.

sparkling 12 hours ago

What is your motivation? Privacy and/or data protection?

I currently don't see a world where it makes sense to run a local model that will eats up 60% of my RAM, 20-30% of my disk space while providing worse quality output than a $20/month subscription.

nozzlegear an hour ago

kingnothing 3 hours ago

plasticsoprano a day ago

Which quant do you use? I have a similar setup and the speed is atrocious at 4-bit.

nozzlegear 21 hours ago

kamranjon a day ago

This is the way

m3h a day ago

I think you should try an OpenAI model like GPT 5.5. It is better at following instructions and boundaries set during prompt. It feels like a more capable "agent assistant" than Claude models but without loss of intelligence.

Most of my work involves "Agentic engineering" instead of fire-and-forget. I like to stay involved during the planning as well as review and ask a lot more questions from the agent than I've seen others doing. In a way, I'm using the agent in a sort of "hyper auto-complete" mode to fill in the blanks (rather big blanks) once I've set out the requirements, scope and design (sometimes specific module boundaries). This works best for me.

ifwinterco 21 hours ago

I prefer GPT 5.5 to Opus but both are absurdly expensive token hogs, I can't afford to use either as my main model at $work with the monthly spend cap we have.

I use Composer (since we use Cursor) or GPT 5.3-codex as my workhorse models and only break out the big guns when I have a genuinely difficult problem to solve.

IMO somewhat weirdly 5.3-codex might be the best overall coding model OpenAI have ever released. It's 90% as good as 5.5 and costs about 20% as much, since it's both cheaper per token and uses fewer tokens for the same task.

I'll miss it when they inevitably deprecate it, but hopefully I can use Kimi K2.7 by then

m3h 21 hours ago

skeptic_ai 16 hours ago

jklmnopqrstuvw a day ago

From my own experience, GLM-5.2 generally cost more tokens and much more slow.

pimeys a day ago

I use GLM 5.2 Fast from Fireworks and its very fast. Where are you using it from?

microtonal a day ago

Which inference provider do you use? (Admittedly, I currently use K2.7 a lot more currently.)

james2doyle a day ago

Tokens and speed are a factor but does it require less back and forth to get things right? Being "fast and cheap but wrong" still has a cost that an otherwise "expensive and slow" exchange does not

paradox460 17 hours ago

mohamedkoubaa a day ago

I've been moving more to Composer 2.5 for the same reason. KISS principle.

everfrustrated 20 hours ago

Composer 2.5 fast (via Grok) is honestly amazing. Its been implementing everything I've asked and getting it right first time. Been impressed with it's front end ability.

If this was the last model I could ever use I think I would be happy.

AdminAdmim a day ago

Same for me, downgraded Cursor Subscription because when i use Cursor i use 90% Composer 2.5 fast

indoordin0saur 17 hours ago

Yeah. Opus is nice for tasks that require significant planning and considering broader effects on other parts of the code. But it likes to go off the rails and do too much. Often it gives good-sounding ideas but it has a tendency to distract me by giving me a huge to-do list.

mattmatheus 21 hours ago

I've been working to use the best model for the task for about 6 months and have found great success doing plan with the 'frontier' model but punting implementation down to a 'lesser' model. I'm using the Beads-Rust (a rust fork of GasTown's beads) as my issue tracker. So far, so good.

whateveracct a day ago

agent-assisted development uses orders of magnitude fewer tokens than agent-driven development

the incentives aren't there sadly

sanderjd a day ago

Not for a business model that scales revenue by token usage. But other business models are available.

whateveracct 12 hours ago

nsoonhui 16 hours ago

Sorry, exactly what is the distinction between agent-assist and agent-driven? T

I give AI an image and just it what's wrong, and then it goes on to fix the bug in the codebase for me ( and write the tests), is this agent-assist or agent-driven?

Sometimes I just give the AI my description, and mockup, and it creates a plan and implements the details for me, and I verify visually ( this is the weak spot of AI), is this agent-assist or agent-driven?

xpct a day ago

I've been largely disappointed how much the Claude models ignore custom instructions, and sometimes even prompts on the chat interface. It sometimes feels like talking to a wall, or as if there was a third person in the chatroom whose messages I can't see.

I can't help but feel this is intentional towards the 'Agentic' workflow.

spacephysics a day ago

I think this seems purposeful, as there's 2 opposing forces at play: - Have a model that follows the users instructions - Have a model that follows the system prompt instructions more

For the 'safety' argument (Re: Fable), they need these models to have basically a 2-tier instruction system, but given LLMs aren't great with actual Logic unless they program it out to test, this runs afoul and we get one or the other.

Feels like optimizing for either precision or recall, but can't have both

wqaatwt a day ago

paradox460 17 hours ago

manveerc a day ago

Totally agreed. I sometimes wonder if they are making the model "lazy" with each iteration, it keeps getting better at avoiding work.

skerit a day ago

marcindulak 20 hours ago

I keep adding selected cases of CLAUDE.md instructions non-compliance reported on claude-code github to that issue https://github.com/anthropics/claude-code/issues/13689. Subjectively the amount of such cases seems lower during the past month. It may be that claude-opus-4-8 (default thinking) is a bit better at instructions following than past models.

gs17 a day ago

> or as if there was a third person in the chatroom whose messages I can't see.

If you set off a classifier, that's how it looks to Claude.

xpct a day ago

storus a day ago

Try to run your prompts through Claude to pinpoint any ambiguous parts that can be interpreted in multiple ways, or self-contradictory sections. I typically resolve any prompt-ignoring issues with that.

mark_l_watson 21 hours ago

Good point, I also like to do the work myself, with an assistant under my control. I am usually really happy with DeepSeek v4 Flash that I feel just mostly does what I tell it to do, but I do switch to Pro for harder tasks.

There are so many models, and I personally ignore benchmarks so it takes some time to try different models on my use cases. Fortunately, it is ‘good enough’ to do the work to find a few models that work for me, and just use them for a month or two before re-investing time for my own evals to possibly change models.

People should evaluate what works for them and ignore other people and benchmarks. (Apologies if that sounds snarky.)

epolanski a day ago

I've been saying for ages that since Opus 4.6 models are increasingly smarter but further unhelpful as assistants.

Fable was amazing as a vibecoder but as an assistant it can't resist jumping into implementation and filling chats of pointless jargon.

It's really grim if you're looking for assistance instead of an implementor.

GPT 5.5 Pro and Fable are gorgeous bullshitters that pretend to be right (often convincingly because they are very smart) even when they are wrong and I need tons of energy to process their information.

I don't like it but don't know what to do, Anthropic models especially increasingly ignore instructions whether in memory or agents files.

thewebguyd a day ago

By design, unfortunately. If they are just assistants, they can't sell the dream of "we're going to replace human labor completely" to the C-suite.

baq a day ago

epolanski a day ago

whstl a day ago

Yep, this is why experiences and ratings of models vary so wildly.

I recently migrated a very large web app to Tailwind and Opus kept screwing up over and over, refactoring and changing the design, the more complex the component became.

I ended up asking Haiku to do it and it managed to do everything correctly, pretty much without intervention.

mullingitover a day ago

> I don't like it but don't know what to do, Anthropic models especially increasingly ignore instructions whether in memory or agents files.

I've taken to instructing the agent to manage the subagent, and the principal agent's sole job is to ensuring the subagent follows instructions to the letter.

epolanski 20 hours ago

Just to follow up on what I mean, this was my first interaction with Sonnet 5:

"I just cloned this repo, investigate how to set it up, don't install anything, just collect information"

_spews information_

I proceed with the setup, but get a Linux specific dependency in a bash script, so I want to evaluate whether it can be rewritten...

"There's this error on MacOS, I think it's because we need linux-utils from brew, verify whether the script can be written in bare posix"

_proceeds installing linux-utils and all the rest_

"Didn't I tell you to not install anything?"

_you're absolutely right_

F*k me..

a_c a day ago

I actually use sonnet 4.6 for my day to day coding too. It consumes much less token and good enough. Opus is just too token consuming for it to be useful to me.

bazhand a day ago

Have you tried '/model opusplan' I've had strong results mixing opus for planning with sonnet implementing.

a_c a day ago

vtail a day ago

ricardonunez 17 hours ago

I am in the same position. Do you think they are going to remove it and deprecate it as some point?

duxup 21 hours ago

“Hey I saw some messed up function commented out that at face value is a bad idea… so here it is again with some nonsense assumptions ….”

I ask “where did you get that?” … too often if I’m not constantly guiding it, and even then it still goes off the rails.

arikrahman 21 hours ago

I have also started shifting to models more reasonable for my wokflow. I've been using the Reasonix harness for Deepseek, and cache hits make the token use basically free. This is with unsubsidized models as well, using American providers.

addozhang 17 hours ago

I feel pretty much the same way, and the scenarios are similar too. Using Sonnet has a bigger advantage when it comes to response time.

bckr 21 hours ago

I suggest you encoding your invariants in the harness. Architectural invariants that can be mechanically checked, including which modules are approved, which dependencies, etc.

lacoolj 20 hours ago

gemma-4-e4b is very good at assistance too, and is local and fast and small (and "free")

trollbridge a day ago

No kidding. I expect to have models to use which are optimised for different use cases.

Sonnet as an autonomous agentic model is silly. We already have other models for that if you want something weaker and cheaper than Opus.

spullara 21 hours ago

if you like that, use gpt models instead.

XCSme 20 hours ago

I just tested it on my benchmarks[0], it's GLM-5.2 level, at 2x cost, but also 2x faster.

Weak spots (categories it fails):

    - Trivia — 0/3 - basically not much built-in knowledge
    - Combined tool-calling tasks — score 45/100, sometimes makes invalid tool calls
    - Puzzle Solving — score 77, flubs carwash-like tests

[0]: https://aibenchy.com/compare/anthropic-claude-sonnet-4-6-med...

nsoonhui 19 hours ago

Your benchmark has Gemini 3.5 Flash as the best model, which doesn't compute for me

XCSme 19 hours ago

It is on top for many benchmarks, only not the coding/agentic ones.

Still one of the most intelligent models overall, most likely to get any question you ask correctly (without tools).

Zababa 2 hours ago

BoorishBears 19 hours ago

This guy had a terrible broken benchmark that gets hawked every release, and I wish HN would ban accounts that essentially exist to hawk a personally owned site, especially such a bad one.

pbgcp2026 15 hours ago

UqWBcuFx6NV4r 18 hours ago

XCSme 20 hours ago

As always, note: faster than GLM-5.2 doesn't mean too much, as GLM-5.2 is served by different providers, so the inference speed can vary drastically between providers or over time.

2muchtime 17 hours ago

Opencode Go/Zen claim to use infrastructure based in the EU, USA and Singapore that have a 0 retention policy.

yieldcrv 20 hours ago

What’s everyone favorite GLM provider?

z.ai doesnt always have the most reliable AI

but I don’t mind the party seeing my trade secrets and thoughts compared to an American corporation + the party seeing my trade secrets and thoughts. So thats not a functional difference to me, and the Chinese one won’t reply to subpoenas so thats a value add tbh

So I’ll consider all, fastest tokens/sec wins

reissbaker 16 hours ago

eli 19 hours ago

pbgcp2026 15 hours ago

Onavo 16 hours ago

WorldPeas 19 hours ago

the (imperfect) comparison having used both for planning and execution is that GLM5.2 is too jumpy and eager to do things, often to a fault (e.g. deploying/using git when it shouldn't) while sonnet 5 was much lazier than any Claude model I have used has been, not adding an addendum to a plan that I asked for, then lying that it did when asked. Looking at the analysis[0] I don't think it's worth it for me. Maybe for others. Fable was certainly much better.

[0]: https://artificialanalysis.ai/models/claude-sonnet-5

simonw 20 hours ago

Claude Sonnet 5 itself described its pelican as looking like a goose:

> Illustration of a white goose riding a bicycle, with one wing extended forward to grip the handlebar, set against a plain white background with a brown ground line.

https://simonwillison.net/2026/Jun/30/claude-sonnet-5/

bel8 18 hours ago

That's possibly the worst pelican I saw from all recent LLMs.

Meanwhile GLM 5.2 drew a cool self-contained fully animated SVG pelican:

https://simonwillison.net/2026/Jun/17/glm-52

simonw 17 hours ago

Yeah, GLM have been beating Anthropic on the pelicans for a while now.

(I suspect that's more of an indication that Anthropic have chosen not to waste resources training on animals riding vehicles, personally.)

kamranjon 12 hours ago

bel8 16 hours ago

philipwhiuk 6 hours ago

Just need the legs to interact with the pedals now :D

bel8 2 hours ago

user3939382 6 hours ago

Mine is to ask it to write a parallel parking simulator and animation. The math there is surprisingly complex including differential equations. Fable 5 can almost one shot it with all tunable params.

Sol- a day ago

Wonder if the whole cyber paranoia leads to their models ultimately generating less secure code. After all, if it has the ability to generate safe code, it would imply that it knows something about cybersecurity, which could surely be used to hack all the banks in the world.

pennomi a day ago

Trying to censor nudity in image generation models caused all kinds of problems with anatomy in image models. I’m sure these models will have similar issues with security.

raincole 19 hours ago

Censorship on image generation models works on another level. The models can generate NSFW, but there are extra computer vision models checking if the images can be shown to the users. It's especially obvious for Grok and ChatGPT.

nodja 7 hours ago

BoorishBears 18 hours ago

NonHyloMorph 20 hours ago

Interesting, you find that in medieval painting, due to the authority of the catholic church.

Traubenfuchs 8 hours ago

I think the cool kids call this "staying away from the vector space of highly skilled security engineers".

deaux a day ago

> Wonder if the whole cyber paranoia leads to their models ultimately generating less secure code.

This may be the goal.

m3h a day ago

Important to note: "Sonnet 5 is an upgrade to Sonnet 4.6, but it uses an updated tokenizer that changes how the model processes text to improve performance (this is similar to the tokenizer change we introduced with Claude Opus 4.7). The tradeoff is that the same input can map to more tokens: roughly 1.0–1.35× depending on the content type. The introductory pricing is set so that the transition to Sonnet 5 is roughly cost-neutral."

ComplexSystems 20 hours ago

So the post-introductory price is set such that Sonnet 5 will cost 100%-135% as much?

m3h 19 hours ago

Correct. Albeit the nuance here is that a more capable model might solve problems more efficiently and faster, possibly saving you tokens.

As with any new model, you won't know the real impact until you start using it for your workload.

mattas a day ago

"We can raise prices in two ways: (1) raise the price per token and (2) increase the number of tokens we generate on your behalf. We promise not to do (2) maliciously. Promise."

conradkay a day ago

I think the incentives are less bad since a good chunk of usage comes from subscription plans.

There was a fairly major regression in Claude Code performance for some time when they changed the system prompt to try and make it less verbose (saving tokens). And if I'm not misremembering, there were a lot of complaints when they changed the default effort from high to medium.

squeegmeister a day ago

Wouldn't it be more malicious for them not to mention this at all?

Alifatisk a day ago

mdrzn 7 hours ago

Edit June 30, 2026: In the original version of this post, we included a cost-performance chart for the BrowseComp evaluation that was based on data from a simpler methodology that did not reflect the standard methodology we use for agentic search evaluations. This had the result of underestimating Sonnet 5's performance on the evaluation.

They changed the Sonnet 5 'Agentic search' benchmark graph overnight

phillipcarter a day ago

Seems to be another great incremental update to the workhorse, nice!

I've been using Sonnet instead of Opus for almost all coding tasks for a while now. A little elbow grease to break down tasks and you can spend a lot less money for just about the same output quality.

SeanAnderson 21 hours ago

Crazy. I just changed the default for our entire org to Opus because people were continually unimpressed with Sonnet's abilities. It's fascinating to think how varied people's experiences are when interacting with LLMs and how much the outcomes depend on how people approach interacting with the models.

thewebguyd a day ago

Yeah I think people are sleeping on the smaller/faster models like Sonnet. As long as you have a detailed plan or small, well scoped individual tasks Sonnet can implement just fine. Opus will still do better at more open ended tasks or completely "vibe coding." Or spec/plan with Opus, and have Sonnet implement.

conradkay a day ago

I was surprised to learn that Sonnet generally has the same tokens per second as Opus

Computer0 21 hours ago

philipwhiuk 6 hours ago

It's a 30% price increase once the discount rate vanishes.

doctoboggan a day ago

The cost per task chart is telling me that I should _never_ use Sonnet 5 above medium effort level - Opus always performs better for a given cost. So I guess the takeaway is that if Sonnet 5 medium isn't good enough for you, switch models, not effort levels.

jimbo808 16 hours ago

They're actively trying to use lobbying power to make open weight models illegal. So I'm just not going to use their services at all anymore. I don't think they're a net gain if you're a skilled senior, and the hidden cost in terms of technical debt and skill atrophy is just being swept under the rug. I'll be okay without their bullshit generator.

pmarreck 14 hours ago

> I don't think they're a net gain if you're a skilled senior

I'm a skilled senior (I'm 54 and been coding since I was about 8; I've been 100% AI-generated code for at least 6 months now and have produced a combination of speed and quality that has astonished me; my velocity is apparent at https://github.com/pmarreck/) and this has been a massive net gain, so your claim is now officially in sheer defiance of reality.

In a skilled senior's hands, this is like an expert power tool. In the hands of someone less-skilled, it is likely also... less-skilled. It's a magnifier.

> and the hidden cost in terms of technical debt and skill atrophy is just being swept under the rug.

Nope, no it's not. It's being reviewed, measured, and controlled against. Because... you WILL need more controls to take full advantage. Look, I even invented a whole new control methodology around it called MFIC: https://gist.github.com/pmarreck/b30aa3ca69cb70a5526f8a63ab8...

jimbo808 12 hours ago

Thanemate 10 hours ago

integricho 12 hours ago

throwaway27448 13 hours ago

kerabatsos 12 hours ago

byzantinegene 14 hours ago

philipwhiuk 6 hours ago

dimitrios1 14 hours ago

sensanaty 9 hours ago

kakacik 9 hours ago

vidarh 11 hours ago

> I don't think they're a net gain if you're a skilled senior

I've had Claude Code running a /loop for the last week driving down complex crashing bugs in a prototype compiler entirely unilaterally. I occasionally glance over.

A few of those crashing test cases were ones I've spent more than a week trying to track down myself. I have 30 years of experience of doing this.

It's worked 24/7.

So far it has fixed over 500 of them.

Will there be technical debt? Yes. But nothing that remotely compares to the cost I'd have incurred of fixing all of those myself.

It is hard to reconcile those gains without thinking that if people are saying these are not a net gain, they haven't really tried learning how to get the full benefit. If you sit and watch a model work and keep intervening all the time, then sure, they're not going to be a net gain.

vidarh 2 hours ago

kelnos 14 hours ago

Why even bother posting, especially as a reply to a completely unrelated comment? This is just not substantive or useful to the conversation.

(And I say this as someone who agrees with you that it's garbage that these companies are trying to legislate their way into an oligopoly.)

jimbo808 12 hours ago

casey2 13 hours ago

lemonteaau 12 hours ago

The irony is that an authoritarian country is leading the world in open models

pjerem 11 hours ago

throwaway27448 13 hours ago

Making software illegal is not an easy task.

anonzzzies 15 hours ago

Sure about Dario (and all billionaire) weirdness, but no gains if you are a skilled senior is well, very far out in our experience (our company is 30 years old with mostly the original employees and founders): what we deliver now at the speed and quality we deliver it would have been impossible 10 years ago with our team size of skilled seniors. We replaced all the commercial products our clients and ourselves used with our own, giving us millions more revenue and profit with the upselling and efficiency benefits. We work for regulated clients: our code is reviewed, pentested and audited regularly by us and 3rd parties so its not slop either. You are definitely leaving money on the table. We do mostly use chinese models on our own hardware (we colocate cages of racks) so this is not about Anthropic but about AI in general.

Skill athrophy is a real thing though; we try to prevent this by have hackethons (for lack of a better word) without AI where I pick something extremely non trivial and we implement it for fun and profit without AI (with would not matter much as they are currently bad at these things); last one was flex paxos for our in house db with obvious metrics for the endresult: data integrity (duh) under failure and performance better or at least the same as our raft production version.

andyroid 14 hours ago

mastazi 15 hours ago

good luck actually enforcing that.

andsoitis 16 hours ago

> They're actively trying to use lobbying power to make open weight models illegal.

What is your evidence?

Robdel12 16 hours ago

jimbo808 16 hours ago

AquinasCoder a day ago

While I appreciate, they publish this information, it's increasingly hard to keep track of it all. I've lost the mental model of how different models at different effort levels perform and what tasks they are good at.

In practice, I tend to just use the default on Claude Code that works well enough. But I wonder to what degree other users really play around with these settings to optimize for their project.

matheusmoreira 19 hours ago

I always use Opus 4.8 at max effort for everything. The $20 subscription didn't have enough tokens, but the $100 one had too many of them. So now I just max out Opus in order to maintain 100% weekly utilization.

Abishek_Muthian 13 hours ago

easygenes 18 hours ago

ATMLOTTOBEER 19 hours ago

chewz 14 hours ago

m-dot-reviews 17 hours ago

I've been plugging this perhaps too many times now, but I am trying to bootstrap a user-sourced corpus of exactly "what model is good at task X". So, not benchmarks, but high-level tasks. There's a bit of a ordering problem in that nobody wants to bother commenting on a site that has few comments - so PTAL and contribute if you can. https://model.reviews

nolok 20 hours ago

Same boat as you, and my answer is "... Except when I ask and overall or checkup task that is specifically heavy or overseeing in which case I use the maximum level" which lately meant ultracode.

I'm not going to play around with thinking level every request because the goal is to make me save time not spend it in a different setting menu.

sanderjd a day ago

What I want is a harness that knows how to optimize this kind of thing for me.

nl 18 hours ago

cunningfatalist a day ago

manojlds a day ago

brobdingnagians 20 hours ago

I tend to run it on High and then step it up for problems where I'm noticing it struggles, bump it back down after. Sometimes I accidentally leave a session in Ultracode for a day and wonder why things are taking so long, but generally happy with the results.

jbvlkt 21 hours ago

Exactly this is my problem with all AI tools. I want someone else to create working tools for me so I can focus on my product. It is the same with other tools. I do not want to spent huge amounts of energy and time to setup my IDE, operating system or desk layout. I guess it is too early to have that now.

jerojero 20 hours ago

jimbo808 16 hours ago

It's really not that much. It's a bit hard to make sense of it not because it's hard to keep track of, but because they are being deceptive and opaque about what you're actually buying, and the thing you're paying for is different from one day to the next, as they fuck around with the parameters to boost subjective performance during a launch, then quietly degrade the service to cut costs.

tash_2s 16 hours ago

I also ended up using max effort/reasoning for both coding and general chat. They don't spend too much extra time on simple tasks these days.

throwaway219450 18 hours ago

Same advice as ever? We call it context engineering now, but prompt engineering still matters a lot. Most of the failures I run into are unspecified assumptions made by the model that derails the conversation, but usually updating the first prompt fixes it. Opus in my experience is a bit better about checking assumptions, while Sonnet will plow on ahead. An example is mentioning a file that doesn't exist: Sonnet will go ahead and try to grep your entire hard drive for it. Opus will say it's not local and request the path.

I trust neither for general knowledge and I still find Opus giving me answers that are completely BS. But the token spend for Q&A is nothing compared to coding, so I always use Opus + a lot of thinking. For coding, I find Opus to be better value/token but I haven't done any sort of rigorous test.

deadbabe 18 hours ago

There are token optimization consultants that can help organizations find the right balance of models for their employees to minimize costs.

j45 19 hours ago

Just because it’s hard to keep track of doesn’t mean it’s not relevant.

Playing around with learning the differences is incredibly helpful to schedule on ones calendar weekly for an hour or two, while saving links throughout the week to try out.

paulddraper 20 hours ago

It's almost like you want an automatically intelligent choice of your artificial intelligence.

Understandable frankly.

jacooper a day ago

Just use deepswe as a reference point.

2001zhaozhao a day ago

There are two wrinkles to this:

- For Claude.ai subscriptions I think Sonnet is much cheaper than Opus. This is why there was a "Sonnet only" usage bar for Max tier for the longest time.

- For some tasks the sheer amount of raw input tokens is the most important. For example multimodal computer use tasks. You can't make them any more efficient on Opus by turning down the reasoning, so a cheaper model like Sonnet is useful for them

timcobb a day ago

> This is why there was a "Sonnet only" usage bar for Max tier for the longest time.

it's still there. I still don't totally grok why I can't use all my tokens on Sonnet if I want to... maybe that signals something?

i000 21 hours ago

laughingcurve 21 hours ago

Torkel a day ago

Yeah, I was looking at the same chart and was very surprised at where the curve is relative to opus... Feels like sonnet 5 is "what if opus had an extra-low effort level"?

energy123 a day ago

The arguable caveat is Sonnet may run faster (although this isn't known for sure, due to more tokens being used for the same task), so you can potentially get more done in a synchronous iterative workflow

I don't really believe this however, because so much time is spent fixing up after models, that a slower but more intelligent model is a net time saver in my experience.

kolinko 20 hours ago

From my benchmarks, sadly, it doesn't seem to be the case much. Surprisingly. I found Sonnet comparable in speed to Opus (sic), but perhaps I was testing it wrong?

riverbirch 19 hours ago

XCSme 20 hours ago

Well, it is a Sonnet model, it is indeed better[0] than Sonnet 4.6 (smarter, faster, cheaper), but I don't see why would you use it as opposed to Opus 4.8 low or GLM-5.2...

[0]: https://aibenchy.com/compare/anthropic-claude-sonnet-4-6-med...

XCSme 20 hours ago

What's interesting, is that Sonnet 5 is actually worse[0] than 4.6 without reasoning.

It makes some sense, as models are trained more and more with reasoning, than without.

[0]: https://aibenchy.com/compare/anthropic-claude-sonnet-4-6-non...

lucamark a day ago

You're referring to the Agentic search, but if you look at the Agentic computer use the cost is basically halved.

However, I am also confused about market positioning. Too expensive to perform daily tasks - open souce models are much cheaper - and not frontier model to address complex real world problems.

Rarely used Sonnet btw.

energy123 a day ago

You're the second person that has said this but I cannot understand why you are interpreting the "Agentic computer use" graph in this manner.

The graph shows that Opus is cheaper than Sonnet for the same performance. Unless I am suffering a cognitive blindness thing right now.

lucamark a day ago

annzabelle 20 hours ago

> Too expensive to perform daily tasks - open souce models are much cheaper

There is a real advantage, especially for businesses, in using an off the shelf solution from a corporate provider.

Personally, the advantage of not having to set up multiple solutions from multiple sources outweighs the cost of a $20 a month subscription. Think about why a lot of consumers prefer Apple devices over Linux. There are a lot of advantages to Linux, but "never having to think about my tools" is its own advantage.

girvo 20 hours ago

The specific market positioning is... for me to use at my big tech company job, where we aren't allowed to use GLM and similar, but have fixed caps on how much token usage we're allowed to rack up a month.

johnfn a day ago

That's just one benchmark, though. Tab to the next one and Sonnet 5 performs better as effort goes up just as you'd expect. I imagine the suggestion is that performance vs effort tradeoff is task dependent.

energy123 a day ago

No it doesn't? It's worse than Opus across the whole shared frontier on both plots.

acchow 21 hours ago

seiru a day ago

Worth noting that the default chart there is for "agentic search performance", not coding. I didn't see an effort comparison for coding specifically.

partsch 20 hours ago

I feel like the charts have been adjusted. I am quite sure, they looked different a couple hours ago...

callahad 18 hours ago

They've absolutely both changed. The initial version I saw didn't include max effort data points on the first chart, and the plot itself was much less favorable to Sonnet at high/xhigh relative to Opus, but the new chart shows them as closer competitors. Weird.

booi a day ago

i actually exclusively use Sonnet in low effort level. It's too slow otherwise and at a higher effort levels is strictly worse than Opus.

intellijdd a day ago

I noticed that as well but with the introductory pricing, I wonder how true that is.

It would be great to see these charts with the promotional pricing just because it’s here for about two whole months.

I guess I could get Sonnet 5 to do it.

manojlds a day ago

Opus 4.8 high doing better and cheaper than Sonnet 5 xhigh

al_borland a day ago

What is a "task" in real-world terms? If it will be $15/million output tokens, and high/xhigh is somewhere in the $7.50/task range. Does that mean a single task is using 500k tokens. That seems like it would start to add up fast.

wyre a day ago

I’ve found input tokens is around 5x more than output, so a task could be a couple million thinking tokens and then a few couple 100k output tokens?

goldenarm 20 hours ago

It's funny the exact same thing happened to Gemini 3.5 flash. Cheaper and more agentic model that ends up worse and more expensive than 3.5 pro low.

Readerium 19 hours ago

3.5 Pro not yet launched, you mean 3.1 pro?

goldenarm 18 hours ago

Natelinathan a day ago

I just re-wrote the /code-review skill anthropic ships to use Sonnet 4.6 for some tasks as it was using Opus for simple git diff commands and similarily mechanical tasks (launched 100+ agents for one of my diffs, cmon). I wonder how Sonnet 5 will impact my usage.

Does anyone else have any review token saving measures?

nicce a day ago

> Opus always performs better for a given cost.

Assume it to get deprecated sooner rather than later.

ZeWaka a day ago

It's very interesting. Why even release a new product that underperforms at the same price level? Why not just lock it?

I guess it's probably a lot cheaper for them to run, and it cuts costs for them. Seems disingenuous, though.

make3 19 hours ago

it might be worth it if speed is an issue

windexh8er 18 hours ago

Except for the fact that Opus 4.8 is not good. Constant hallucinations, doesn't use the web very intentionally until you explicitly ask it to and it nopes out rather quick on benign items. Anthropic has been very disappointing as of late. All of the gatekeeping is taking a toll on what should be some of the better models out there, but you can't trust 4.8 to go off on its own. It will burn down tokens doing what it deems correct as per its guidance. Truly painful to use.

lukan 18 hours ago

"but you can't trust 4.8 to go off on its own."

And what (avaiable) model do you trust to go off on its own?

windexh8er 14 hours ago

nijave 5 hours ago

> Me: What was the sushi place near latitude 41 in Columbus? Did it go out of business I don't see it on Google maps anymore

> Sonnet 5 (medium): None of these past chats mention a sushi place — I don't have anything on record about that. Do you remember the name, or roughly which part of Columbus (neighborhood/street) it was near? That'll help me search and check its current status.

Not impressed. It got the name right on high effort one shot but hallucinated the date relativity (Jan 2026 is not last month...). Worked okay on extra. Sonnet 4.6 worked fine on medium, high, and extra one shot.

Edit: as the replies point out, the prompt is definitely ambiguous however Sonnet 5 didn't even extract the semantic meaning "looking for <place> near <compound place>" which all the human replies seem to understand. Even Haiku 4.5 identifies the semantic meaning although it fails to retrieve the correct results

> Haiku 4.5 (reasoning off) I'll help you find information about that sushi place. Let me search for sushi restaurants that were near latitude 41 in Columbus and check their current status.

>_Searched the web_

> I see that "Latitude 41" is a restaurant in Columbus, but it's actually a modern American restaurant, not specifically a sushi place. However, based on your mention of latitude 41, that's what came up. Let me search more specifically for sushi restaurants that may have closed in Columbus around that area.

Mtinie 5 hours ago

What was your expectation? That your prompt would trigger a web search, first, before the introspection of past conversations and a training set recall?

How did Sonnet 4.6 respond that was objectively better for your use case?

sejje 5 hours ago

Try it 25 more times and let us know how it averages out. It's non-deterministic, remember?

nijave 5 hours ago

I tried 3 more times. Two were nearly identical and 1 recognized Latitude 41 as a restaurant but had a similar useless reply

ben_w 5 hours ago

satvikpendem a day ago

> Evaluations also show that it has a much lower ability to perform cybersecurity tasks than our current Opus models.

Why would they brag about something like this? It's like they know people want to use models to perform cybersecurity tasks yet knowingly deny them the ability.

And Opus 4.8 is still cheaper for a higher pass rate (much less open weight models like GLM 5.2) so not sure why I'd use Sonnet except on the low effort level for I suppose trivial tasks where I want it to work only 50% of the time judging by the graph. The pricing doesn't really make any sense.

secretslol a day ago

"Lower ability to perform cybersecurity-related tasks" makes me super concerned it will leave my codebase like Swiss cheese for any American granny with access to Fable 5, when we non-American Brits, or rest-of-worlders, don't have access to it to clean our codebases.

__alexs a day ago

100% this. I read these caveats in new models and all I hear is "we made sure this model has no idea about computer security." Such a weird thing to brag about.

doublescoop a day ago

This is code for "this model can't be used to hack other systems as effectively as Opus or Mythos."

kube-system a day ago

matheusmoreira 19 hours ago

That's the literal mission of the NSA. Security and strong cryptography for the US while everyone else gets "export grade" nonsense.

cute_boi a day ago

I think they don’t understand that cybersecurity skills are what prevent bad code from making it into production.

It’s like telling a chef to cook without a knife because knives can kill people.

Dario and his lackeys at Anthropic aren’t visionaries.

norseboar a day ago

baq a day ago

Aeolun 19 hours ago

I think that increasingly, the US will have to be passed by for these things. Clearly we’ll have to start looking to China for world leadership, to be the land of the free.

kube-system a day ago

> any American granny with access to Fable 5,

Fable is effectively not available to the general public in the US either

secretslol 19 hours ago

goalieca a day ago

That’s not even close to true. Unless you’re vibe coding trash that a better model might catch.

secretslol a day ago

zlurker a day ago

They spent months hyping up Mythos and ended up with it banned. I’d assume they want to both differentiate their products and appeal to regulators here

worldsavior a day ago

They will release it eventually. Once they see the Chinese models are close to Mythos level they will release it before, so it will be "revolutionary".

jaapz a day ago

sixothree a day ago

I'm starting to think it discovered a 0-day held hidden by our government.

noumenon1111 21 hours ago

kristianc a day ago

There's two classes of models now - the cybersecurity ones that none of us are getting, and the 'safe' models released for general consumption. This is letting us know which side of the divide it sits on.

Taek a day ago

There's also Chinese models, which aren't trying to self-limit capabilities.

axus a day ago

baq a day ago

bwat49 a day ago

this seems rather counter-productive, wouldn't a model with less cybersecurity capabilities be more likely to produce insecure code? Not to mention, Chinese models don't have these restrictions and can be used to exploit said unsecure code.

I supposed I shouldn't be surprised at how the trump admin is approaching AI regulation, counter-productive is really all they do

ihsw 21 hours ago

K0balt a day ago

Restricting the models isn’t about restricting offensive capabilities. They were already very well aligned to reduce that risk.

This recent government interference is about trying to preserve US offensive cyberwarfare and cyberespionage capabilities. It’s not about “bad actors”. It’s about defensive capabilities becoming pervasive and cheap, which would kneecap us cyberoffensive capability.

It’s like making seatbelts illegal so that police chases can be more effective.

MostlyStable a day ago

Why do you think they are bragging? Anthropic has long been the company to give us by far the most in-depth information about their models, both positive and negative. I read this as them just stating a fact about this model that users would want to know.

organsnyder a day ago

I'm absolutely certain that their marketing team has input on (if not owning) these announcements.

gallerdude a day ago

MostlyStable a day ago

MallocVoidstar a day ago

The preceding sentence is

>Our safety assessments found that Sonnet 5 shows an overall lower rate of undesirable behaviors than Sonnet 4.6, and is generally safer to use in agentic contexts.

which is obviously painting that as a good thing. So reading the next sentence as "in other good news" is reasonable.

MostlyStable a day ago

satvikpendem a day ago

Anthropomorphic, most in-depth? That's laughable given how closed down they've been over the years. If you want in-depth, DeepSeek actually still publishes papers of their methods for anyone to implement leading to being by far the most cost efficient model provider for the performance.

MostlyStable a day ago

bluepeter a day ago

Flowers for Algernon. And, sadly, expect this from now on. You saw it with OpenAI releasing Sol/Terra/Luna with a chart showing how they weren't quite as good as Mythos. It's all messaging to the USG to try to avoid/minimize arbitrary review from multiple agencies. 'Hey, it's smart, but look how stupid it is at "cyber."'

dgacmu a day ago

One of the best queries I've done with an LLM recently was: Create a plan for improving the robustness and resilience of this code, particularly to untrusted inputs.

Gemini wouldn't do a security audit. But it came up with a great set of mitigations and identified an extant XSS flaw in the process of improving robustness.

There's an awful lot of good that can come from proactive, defensive use of LLMs. I realize there's also a lot of pain when the difficulty of exploit finding drops suddenly, but in the long term we may all benefit from the defensive side of this.

lanthissa a day ago

so it doesn't get blocked. last time they said a model was great at cyber it didnt turn out well

Philpax a day ago

To avoid Lutnick getting on their case again.

dgellow a day ago

He has the opportunity to do the funniest thing ever

johnfn a day ago

> Why would they brag about something like this? It's like they know people want to use models to perform cybersecurity tasks yet knowingly deny them the ability.

What exactly do you want Anthropic to say here? "This model, the one we are about to give to the entire world for cheap, is really good at hacking"? Saying Sonnet is terrible at cybersecurity is the most reasonable thing they can say, out of a lot of bad options.

nozzlegear a day ago

It seems obvious to me that they put that in there in an effort to avoid another reaming out by the long, orange dick of the US government.

pseudosavant 21 hours ago

So that the current US administration doesn't block broad usage of Sonnet 5 probably. They'd have to collect your ID and approve you if it was good at cybersecurity. Because such is the freedom in the U.S. right now.

doctoboggan a day ago

You have to pay more for that, and/or go through some USG vetting process.

2001zhaozhao a day ago

They are obviously trying to avoid getting Sonnet 5 blocked.

WithinReason a day ago

That part is likely directly addressed to the US government.

chvid a day ago

Does it mean it generates code with random security holes?

jayd16 a day ago

Market segmentation?

re-thc a day ago

> And Opus 4.8 is still cheaper for a higher pass rate

Unless it spams as much as Opus, I doubt it. Opus 4.8 literally spams text like puke. On a longer run especially if you get cache misses here and there the bulk of the cost is all the extra context it adds.

drcongo a day ago

What makes that a brag?

johnfahey a day ago

Judging from those cost-performance graphs, Sonnet doesn't make sense to run at anything higher than a medium reasoning level, since Opus 4.8 low reasoning outclasses it for the price.

This line as a selling point is also pretty funny:

> Evaluations also show that it has a much lower ability to perform cybersecurity tasks than our current Opus models.

ianberdin 19 hours ago

Anthropic outsmarted everyone again.

They released Sonnet 5 with a temporary price reduction until August. Everyone was excited, but in reality, they increased the tokenizer size by 50%. As a result, the actual cost went up by 50%, they shifted everyone's attention to decrease.

Thus, Anthropic is raising prices but not telling anyone about it. Nobody is really aware of it. You go to the pricing page, the price looks the same. Yet people are actually paying 50% more.

Very shady marketing.

And of course they lie about 35% again. In reality with coding it is 50%.

UPD: I run playcode.io, so it’s my job test all models, their pricing, quality in order to provide best price/quality/speedy/reliability to non-techy.

wolttam a day ago

I didn't think they'd actually release a model that was worse than the open-weight frontier and at a higher price-point. Wow.

LUmBULtERA a day ago

That's yet to be determined. I think a lot of open-weight models are benchmaxxed and their usefulness for many tasks are not represented by those.

enraged_camel a day ago

Yes, this has been my experience. They all struggle with long-horizon tasks and eventually start going in circles.

winrid 7 hours ago

Today I tested GLM 5.2 by giving it an example stylesheet and told it to change the background color of a submit button.

It then hallucinated the submit button class...

s3p a day ago

Why did the other reply to this get flagged as dead? It was a comment about how someone would come out saying that Sonnet 5 would be better on the pelican test and therefore it has to be good. But I guess HN loves pelican SVGs so much that you're not allowed to criticize it.

steveklabnik a day ago

If you look at the account history, it's pretty clearly an account-level thing, not a comment-level thing.

theLiminator a day ago

Seems like the way to go for any smaller models is to only use the low reasoning levels, and for anything where you'd want it to reason harder, to just use a larger model.

In effect, high reasoning only makes sense when you're using the frontier model and need extra performance (higher levels of reasoning are never pareto optimal unless you're at the largest model size).

adam_arthur 21 hours ago

I've found disabling reasoning entirely but adding a "reason" to the JSON response from the LLM to work significantly faster and consume many fewer tokens for narrowly scoped prompts.

At least for Claude family models.

e.g. {

  "reason": "<Describe why you picked this result>",

  "selection": "<The number of the value you selected>"

}

I'm sure native reasoning produces more accurate results, but for my use case the quality was about the same, and the model would reason for thousands of tokens in native reasoning vs just 1-200 with response level reasoning.

Again, to be clear, this is for deterministic/pipeline style workflows, not agentic/coding use.

grim_io 9 hours ago

What you are doing, is producing an unnecessary summary of the result, not reasoning that models do to come up with the result.

I don't get what value you get out of this.

adam_arthur 4 hours ago

docheinestages a day ago

My experience with using low reasoning effort has been nothing but a waste of time. Claude often keeps guessing, not calling tools to ground itself, and basically at the end I end up wasting the same amount of tokens or just switch to Opus on xhigh. It's been a terrible experience.

mwigdahl a day ago

Not to sound like an LLM, but that seems exactly right to me. Use it as a cheaper, high-functioning task subagent and lower reasoning for a master Opus session. As long as not every portion of your task requires maximum intelligence, you should come out ahead.

user43928 a day ago

Won't any input be charged uncached, and the output of the small model charged again as uncached input to the bigger model?

I don't know whether that comes out ahead compared to just staying with the better model in the first place.

mwigdahl a day ago

theHocineSaad 5 hours ago

What's interesting is that Claude Sonnet 5 costs more per task ($2.29) than Opus 4.8 ($1.80), while the latter is obviously better!

It actually costs more per task than every other model. It's only cheaper than Claude Fable 5.

Source: https://artificialanalysis.ai/?cost=cost-per-task#price-and-..., as of writing this comment (the results are frequently changing)

mag7269 a day ago

When can we get a new Haiku? 4.5 came out nearly a year ago, and it's showing its age.

scosman a day ago

Look at Qwen for that level of intelligence.

anthonypasq a day ago

needs to be on bedrock for me to use it at work

0xbadcafebee a day ago

brunooliv a day ago

I only wish Opus 4.6 from earlier this year at a faster inference speed. Since Opus 4.6 things have been so much messier and the overall push for more agency isn’t really panning out for agent assisted development as much as they would like

fractorial 18 hours ago

I still use Opus 4.6 (with later models for subagents only sometimes), but I have been preparing for it to go away.

ashvardanian 18 hours ago

Got really excited for this model and asked my Opus planners in 3 pretty different projects to use Sonnets instead of Opus subagents to help me experiment on HPC kernels faster. Not one of them ended up writing a single line of code... Sonnets just kept spinning, wasting tokens. Can't remember the last time it happened with Opus in my codebases. Reverting back.

bearjaws 17 hours ago

I've seen this happen before when they launch new models. When Opus 4.7 came out it was "working" for 20+ min before I just exited entirely and waited till next day.

Went away on it's own.

phtrivier 20 hours ago

What is the reference, unbiased, honest, reputable and trustworthy site that ranks and compare models on the couple of realistic metrics that matters ? ("Does it work for code", "no, I mean, for real", "how much does it cost", etc...) ?

kccqzy 19 hours ago

It’s not really possible unless you try. Different people use models so differently. The whole model situation has made public minute differences in personal preferences in the process of coding. Some people think carefully and strive to write code that’s as bug free as humanly possible on the first try; others write something that is only approximately correct and then iterate afterwards. The former people would align with a model that thinks for 40 minutes before producing flawless code; the latter would be driven mad by this excessive thinking. Some people like to interrupt AI as soon as they see AI making a mistake, others let AI continue and tell them about the mistake afterwards.

girvo 20 hours ago

Truthfully? There isn't one. They all have flaws. Your best bet is to look at all of them, and then run a suite of evals yourself. Its rough out here!

bel8 20 hours ago

The only metric that worked for me is running the same prompt 5x for each LLMs on my projects.

I keep specific branches a state where they are ready to develop new features.

garo-pro a day ago

Seems like the cyber detection even is on Sonnet now. https://support.claude.com/en/articles/14604842-real-time-cy...

SkitterKherpi a day ago

$5/$25 for Opus 4.8 vs $3/$15 doesnt seem cheaper enough to be too worth it. It depends how much better it is than e.g. Mimo, but I imagine Mimo and co to be too cost efficient in the lower tier to be overtaken by Sonnet for most tasks.

make3 13 hours ago

it's also a lot faster I would assume

sreekanth850 8 hours ago

After using codex i will never return to cc even if they offer it for free.

runnig 2 hours ago

I tried Sonnet 5 and burned the entire 5h quota on a single deep research run. This has never happened with Opus before.

mchusma a day ago

This is much more interesting of a model at $2/$10 (their launch pricing) than at full price. There are many competing models at around this level of performance.

I also like that the difference between low, medium, high, xhigh seems more spread, which is actually a good thing for people trying to tune applications. Running Sonnet 5 on low with the launch pricing makes this potentially a better fit than Haiku or open source models for some tasks. I don't think it will make sense at full price.

mchusma a day ago

Really if they wanted a standout model that would really take the wind out of GLM's sails, they should have made this the new Haiku, priced at Haiku levels with this performance.

alvis a day ago

Ironically, the key message of today's release is that Sonnet 5 is far less capable than Opus 4.8 and Mythos 5. It's a funny development is the past few weeks

solenoid0937 12 hours ago

Duh? It's their cheapest model aside from Haiku.

tokengod a day ago

That’s nice, but we want Fable

giancarlostoro a day ago

The reality is that Fable will eventually be obsolete and Sonnet / Opus will surpass it. Fable did cost 2x as much as Opus, so I assume it involves a much higher cost for what it did, but I wouldn't be surprised if Fable will be obsoleted by Opus or even Sonnet sooner or later at less cost.

maxloh 4 hours ago

According to CursorBench [0], Fable is the first runner-up, scoring 72.9% ($18.02, Max), while Opus 4.7 Max hits 64.8% ($11.02) and GPT-5.5 Extra High sits at 64.3% ($4.37).

I bet most American companies would choose Fable over GPT-5.5. Employee salaries cost far more than token costs. Getting the job done right is much more important.

[0]: https://cursor.com/cursorbench

ianhawes a day ago

Okay I don’t care about “eventually”, I want Fable now.

arcatech a day ago

astlouis44 a day ago

Same

DonsDiscountGas a day ago

I'd love if they would include speed (though I know there are difficulties involved). At this point the quality of Opus 4.8 is no longer my limiting factor, it's the speed, so a faster model would be great.

boc a day ago

Have you tried Opus on fast mode?

DonsDiscountGas 19 hours ago

I haven't because I'm not made of money but maybe I will

gertlabs 10 hours ago

In our coding evaluations, we found Sonnet 5 is more capable than Sonnet 4.6 (which was an underrated model itself), but is now faster and slightly cheaper.

Sonnet 5's performance is comparable to GLM 5.2 in both one-shot coding and agentic ability. However, it's about ~20% less verbose than GLM 5.2 in average code submission sizes, and uses fewer reasoning tokens, which reduces the cost gap and suggests it writes cleaner code. In practice, Sonnet 5 ends up being 40% more expensive and ~2x faster than GLM 5.2 in our evaluations (not 300% more expensive as the per-token pricing would suggest). Granted, GLM 5.2 is an extremely reasoning heavy model.

Overall, it's a solid release that gives Anthropic some standing in the price-conscious inference market.

Data at https://gertlabs.com/rankings

Zababa 10 hours ago

Artificial analysis shows Sonnet 5 as ~2 times more verbose than GLM 5.2. I wouldn't call Sonnet 4.6 underrated, it's in "chinese open source model territory" and unless you rely only on subscriptions it has alternatives.

chipgap98 a day ago

Interesting that tasks on extra high cost almost the same as Opus 4.8 with a slightly worse performance

bredren a day ago

This is on the browsercomp graph, right?

In that, it seems sonnet 5 on high costs more than opus 4.8 at a lower pass rate. Am I reading this correctly?

Edit: It looks like the key value proposition of the updated model is that it is much better than Sonnet 4.6.

Wheras, Sonnet 5 delivers great value (by browsercomp benchmarks and compared to opus) when running in low and medium.

So: Sonnet 4.6 should ~never have been run for low, medium or high when Opus 4.8 has been available. Whoops, I think I have some skills that delegate easy stuff to Sonnet.

---

I remember Anthropic pivoting everyone's default model to Opus but had not seen it put so starkly before.

I am a bit confused on the subscription `/usage` screen. It splits out sonnet usage, and I'd presumed that would have contributed to a lower use of subscription Quota.

But if this is correct, Sonnet usage was basically like smoking unfiltered cigarettes.

mchusma a day ago

I agree with this assessment, IMO my takeaway from this is "Generally run Sonnet on low, otherwise use Opus". It's kind of like an "extra low" setting of Opus. (depends on the application for sure).

bredren a day ago

mcbuilder a day ago

LRMs are plateauing for sure, not that there won't be gains to be had in the future, but it's not like the era of rapid progress that was the past year any more.

gdhkgdhkvff a day ago

I agree that the rapid improvement from like 2023-24 era is over (from a perspective of going from a 3/10 to a 7/10, you can’t then go to a 11/10). There was just so much more space to grow back then.

But isn’t Fable supposed to be another step change? I never used it, myself.

Tbh, at this point I think top tier models are smart “enough” (I’m sure this will look antiquated in a year), and the way to give me MORE noticeable improvement is to make them much faster rather than much smarter. Or even a way to automatically and accurately pick faster models when it makes sense. I know that IDE’s have Auto modes, but it’s not something that I trust right now to pick smart+fast instead of picking “maybe smart enough”+”cheaper for harness owner”

ZeroCool2u 11 hours ago

roughly a day ago

A great many people were predicting this would be the case a year ago and being told they were wrong and to get on the boat.

mcbuilder a day ago

terekhindc 3 hours ago

cost per task > opus-low is a weird place to land. is there a specific task shape where sonnet 5 medium actually wins?

827a 21 hours ago

Tbh we'll see what using it looks like, but the reasoning/cost charts do not look promising. It seems like the only useful reasoning level for Sonnet 5 is Low; medium might trade blows at price/performance with Opus, but anything beyond that Opus is Just Better.

I struggle to understand where this model fits in. If I need a cheap model for simple stuff (like, summarizing an email); I'd go Haiku (actually, I'd go Deepseek v4 Flash, but you catch my drift). I just can't think of many tasks where I'm like "yeah let me reach for Sonnet Low Reasoning so I can save a dollar but also seriously run the risk of it failing"; I'd just reach for Opus Low.

brokencode 21 hours ago

Kind of crazy how bad this release actually is. I even dug around in the full system card, and every graph showed the same thing.

Low and maybe medium will save money on simpler tasks, but after that it just isn’t worth it compared to Opus.

I wish they would have explained in the blog post why they think anybody would ever want to use this above medium.

Maybe it works well on things that aren’t clear in the benchmarks.

siva7 12 hours ago

Why would a company explain how limited their own major release is?

johnhamlin a day ago

Kind of hilarious how much they’re touting that it sucks at cybersecurity like it’s a feature

babelfish a day ago

System Card: https://www-cdn.anthropic.com/d9bb04416ffe1352af84721476c1fa...

midtake 17 hours ago

5 as in 5 times more likely to tell you that you can't edit your driver INF files because that enables DRM circumvention and is dangerous!

stavarotti 17 hours ago

I’ll continue to use the last great reasonably affordable duo from Anthropic: Opus 4.6 for planning and Sonnet 4.6 for implementation.

richardfey 8 hours ago

I don't know what I am doing right, or wrong, but I have access to claude and codex and I find myself giving the more serious work to codex recently. I tend to trust it more. I might try again Fable when it's back, but this Sonnet 5 didn't work well for my current projects.

jaggirs 7 hours ago

Same (Opus 4.8 vs gpt 5.5)

I keep having to correct 4.8, but 5.5 more often than not is correcting me.

Opus writes a bit nicer though and it is easier to follow wat it is doing/saying. Not too different experience from talking to humans: 5.5 feels like a very smart 'nerd' that doesn't make a huge effort to communicate wel, while Opus is a bit less intelligent but that makes it's ideas easier to communicate

jbritton 17 hours ago

I accidentally used Sonnet 5 a bit today. It seemed significantly worse to me than Opus 4.8 for software development.

taspeotis 19 hours ago

> Claude Opus 4.7 and later Opus models, Claude Fable 5, Claude Mythos 5, Claude Mythos Preview, and Claude Sonnet 5 use a newer tokenizer that contributes to their improved performance on a wide range of tasks. This tokenizer produces approximately 30% more tokens for the same text. Claude Sonnet 4.6 and earlier models use the previous tokenizer.

Alien1Being 14 hours ago

Only if you have no problem with their extremely harmful political lobbying.

boutell 19 hours ago

Until now we've been using Sonnet 4 to power an editing agent in ApostropheCMS. Sonnet is a good price/quality/speed compromise, but sometimes when giving it a large set of instructions it would miss half of them. At least until we told it to go back and try again.

In my early tests tonight, Sonnet 5 is a LOT better out of the box. It's one-shotting complex instructions. It also recovered independently from bad instructions that led to an uninformative 400 error by using its schema-fetching tool to figure out there were was too much input.

If I have to gripe about something: it interpreted another impossible instruction by quietly discarding the input in question. But, the way it did it is... kinda exactly what anybody else would do, if they weren't in a position to change the implementation.

This is, obviously, early days but I'm impressed.

mosbyllc 8 hours ago

Claude is a great model for me, but unfortunately, its quota is often insufficient. It seems that many people are now considering Codex as an alternative. If the quota is sufficient, I believe many people will continue to use the Claude Code model.

hdjrudni 8 hours ago

Codex is not better anymore. It appears they nerfed their quota a few weeks ago. I never used to hit my 5 hr limit, now I always do. Sometimes in like 2 prompts.

pheewma 4 hours ago

Codex was running a 2x usage promotion from around the time when Claude introduced rate limiting during peak hours, until May 31st. The various relevant subreddits were (more) insufferable: just 1000 posts per day to the tune of "Just switched the codex! So much more usage!" only to have that tone flip immediately after the promo ran out.

iLoveOncall 8 hours ago

Neither Claude, nor Codex, nor Claude Code are models.

Claude is a series of models (Claude Sonnet X, Claude Opus X, etc.), Claude Code is their development CLI that uses their models, and Codex is the same as Claude Code but from OpenAI.

Ultimately the quota is linked to neither of those 3 directly, rather to which specific model you invoke.

epsteingpt 16 hours ago

If only the agentic model supported the most popular agents like Hermes and OpenClaw...

Escapade5160 9 hours ago

At that price you should just use glm-5.2. You get an Opus class model for 1/3 the cost.

andai a day ago

Opus 4.8 beats Sonnet 5 on the pareto frontier in several of their graphs (Agentic Search, Agentic Computer Use).

In other words, for certain tasks, Opus 4.8 is cheaper than Sonnet 5, and does better than Sonnet 5.

I've noticed this pattern on a lot of benchmarks. You can try to emulate a bigger model by ramping up the test time compute (max reasoning, more turns, model fusion etc.), but you can't reach the same quality level, and you often exceed the cost you would have paid by just using a bigger model.

tldr: if you're doing something hard, just use a bigger model.

copperx a day ago

And Claude Code penalizes you for using Sonnet on the subscription plan, so there's little reason to use it.

bredren a day ago

This is what I realized, can you provide more detail on how you've observed this? The /usage screen does not make it clear.

MillionOClock a day ago

gverrilla a day ago

How so?

cenobyte a day ago

Claude Sonnet 5 is built to be the most agentic Sonnet model yet.

The Dodge Charger is built to be the most Charger like car yet.

kingjimmy a day ago

interesting footnotes: "Sonnet 5 is an upgrade to Sonnet 4.6, but it uses an updated tokenizer... can map to more tokens: roughly 1.0–1.35× depending on the content type." AKA expect higher costs on Sonnet 5 vs Sonnet 4.6 for the same tasks.

winstonp a day ago

same happened to Opus 4.7

theplumber a day ago

Is there any reason to use Sonnet instead of GLM?

hootz a day ago

Your US company banning usage of non-american models. Other than that, no.

jedisct1 a day ago

This.

atemerev a day ago

Speed. But mostly no.

grim_io 9 hours ago

The exact opposite, actually.

Sonnet is slower due to much higher output and reasoning token generation.

tripleee a day ago

interesting how much worse the sentiment around Anthropic is getting

mwigdahl a day ago

Seems like a combination of multiple factors:

"They took my shit away!" -- 3-day Fable 5 addicts (me)

"How dare they tell Trump no?" -- US nationalist / "my country right or wrong" types

"Great to see a closed source company fail!" -- open source boosters

"Great to see an American company fail!" -- anti-US, and/or pro-China folks

"Great to see a successful company fail!" -- anti-capitalists and/or sour-grapes crab bucket types

"Serves you right for ripping off creators!" -- copyright warriors

"They keep silently nerfing the models!" -- secret downgrade conspiracy theorists

"Quit killing the planet!" -- anti-datacenter advocates

thepasch a day ago

I'm personally in the "they keep releasing shameless lobbying papers disguised as thinly veiled research or essay-coded content, push anticompetitive walled-garden practices, show little else but contempt for their non-enterprise customer base, refuse to communicate about anything and choose public silence as their baseline, seemingly force their employees into vows of public silence as well, actively degrade their products across the board with their vibeslop approach with measurable impacts on customers, openly attack not only open weights models but open source software, and all while pretending they're the 'public benefit corporation' formed by a valiant group of heroes escaping from a duplicitous snake and who, even in light of their own massively duplicitous behavior as of late, should apparently be trusted to be the some sort of arbiter over what this tech should get to be and how it should get to be used while they could hardly be more gleeful about how we're all going to be replaced in 6 months from now perpetually" camp.

Which is a bit of a bummer considering they do genuinely make the best model that's most pleasant to work with in my opinion.

tripleee a day ago

It seems to be more them losing goodwill combined with their marketing.

I don't agree with your framing that all negativity is from crazies

mwigdahl a day ago

dimgl 12 hours ago

Yeah you're overthinking it. Their product releases and their general approach to business is harming their business.

0xbadcafebee a day ago

"OpenAI models are better, cheaper, and more reliable" - rational people

noumenon1111 20 hours ago

Most of these are good points though with the right framing.

baalimago a day ago

Not looking great for an upcoming IPO

mrcwinn a day ago

You’re right, it’s looking stellar. Well beyond great. Real, and unprecedented, revenue growth will do that for a company.

CuriouslyC a day ago

"Real and unprecedented revenue growth"

Bro that is financial engineering, not real revenue growth. They engineered the switch to usage based pricing and a price hike timed the quarter before they wanted to go public, long enough to juice their numbers but not long enough for them not to be able to manage backlash and have to walk things back. Then they tried to extrapolate that manufactured bump to make it look like they have record shattering revenue growth.

docheinestages a day ago

But does it burn tokens just like Opus? That's the feeling I have nowadays. Regardless of what model I choose, the 5-hour limit gets exhausted in the first hour or so.

a_c a day ago

"Claude Sonnet 5 is available everywhere today at an introductory price of $2 per million input tokens and $10 per million output tokens through August 31, 2026. It then moves to standard pricing at $3 per million input tokens and $15 per million output tokens.2"

"Sonnet 5 is an upgrade to Sonnet 4.6, but it uses an updated tokenizer that changes how the model processes text to improve performance (this is similar to the tokenizer change we introduced with Claude Opus 4.7). The tradeoff is that the same input can map to more tokens: roughly 1.0–1.35× depending on the content type. The introductory pricing is set so that the transition to Sonnet 5 is roughly cost-neutral."

If we trust them, then it is roughly the same as sonnet 4.6

alvis a day ago

What I starting to hate is that each model's effort level can mean completely different power.

Today sonnet 5's med level effort is equivalent to sonnet 4.6 low level effort :/

nsingh2 a day ago

That seems to only be true for the "Agentic Search" benchmark. That benchmark in particular is a bit weird, because Sonnet 4.6 effort levels had a relatively small effect, so Sonnet 5 med is basically comparable to all effort levels of Sonnet 4.6.

benjiro29 a day ago

Anybody notice that they did not include Sonnet 5 Max in the "Agentic Search results", when comparing to Opus 4.8 ...

Based upon the "Agentic Computer usage", Sonnet 5 Max was going to be off "Agentic Search results" chart. lol ...

In short, Sonnet 5 Low/Medium is more cost efficient, if its a task below Opus 4.8 Medium. For the rest its expensive and your better off using Opus 4.8.

Why even release this model?

ricardobeat a day ago

Because it’s a massive improvement over the previous model, and cheaper?

You are reading too much into the graph and ignoring the threshold of usefulness for real world tasks. By that logic Sonnet 4.5 would have never been worth using.

benjiro29 a day ago

Am i missing something? Because your making my point. Its only worth it compared to Opus 4.8, if the tasks your running requires Opus 4.8 low (or non-existing lower).

For the rest the gap in pricing vs efficiency is so small, that there is no point in using Sonnet. I am looking at their own cost comparisons vs efficiency...

ricardobeat a day ago

bredren a day ago

I'd narrow that to why even allow the harness to run `high` on this model?

crorella 17 hours ago

Fun/interesting to see how opensource models surpassed Anthropic's

m3h a day ago

Why is Claude Sonnet 5 allowed to be released but OpenAI Terra not? Are they not the same class of models?

swe_dima a day ago

Not sure what niche it's going to occupy: too expensive for it's intelligence category.

ThouYS 21 hours ago

Why did this get the coveted "5"? I want an Opus that can compete with GPT 5.5

Cu3PO42 a day ago

Sonnet 5 is not currently available in the EU region on Bedrock, whereas previous models were and still are. I wonder if this is only due to early stages of the rollout or if this is due to recent US restrictions.

Unfortunately that means I won't be using it at work for now.

mellosty a day ago

Sonnet seems to be really expensive

mrcwinn a day ago

Have you followed Anthropic at all?

rw2 a day ago

The use of the "cheaper models" in big AI companies are next to useless as they don't even score as well as the open/super cheap Chinese models. Only the frontier big models like Fable and Opus have value.

mellosty a day ago

It does not pass the "I want to wash my car, should I drive or walk"

cheesecompiler a day ago

did for me even on low non thinking effort

gverrilla a day ago

GIGO, as they say.

addozhang 18 hours ago

In the 4.x era, I prefer Sonnet to Opus. The quality of Sonnet generation is good enough for me, but it's much faster than Opus.

SoKamil a day ago

I believe that’s gonna be meta for agentic coding this year for enterprises. Cost optimized models approaching SOTA capabilities on software engineering but without cybersec training.

beernet a day ago

Anthropic's run on the model and product side of things is highly impressive. They got Sam A. punching the air consistently, which is well-deserved and self-inflicted above all.

CuriouslyC a day ago

Wdym? They've been knocking it out of the park on marketing, but Claude Code is still a meme, and Opus is getting trashed by GPT5.5 meanwhile you can't even use their "dominant" model, and anecdotal reports from when people could use Fable, when they weren't getting silently poisoned, was that it was only marginally better than GPT 5.5 in terms of SWE smarts, mostly being better in terms of pleasantness to interact with and design taste.

beernet a day ago

> Claude Code is still a meme

Claude Code generates more revenue than OpenAI...It appears to be a nice meme.

CuriouslyC a day ago

edude03 20 hours ago

Let’s see how long until opus 5 comes out but to me this lends some credence to the rumour that fable/mythos was supposed to be opus 5

scottfits a day ago

> the computer use evaluation OSWorld-Verified. Sonnet 5 (orange line) is a strict improvement over Sonnet 4.6

cool to see, still waiting for models to get better at computer use.

arendtio a day ago

> Evaluations also show that it has a much lower ability to perform cybersecurity tasks than our current Opus models.

It seems being incompetent is a feature now...

primaprashant a day ago

Based on both performance vs price charts, it seems using Opus 4.8 with med effort is almost a better choice than using Sonnet 5 at xhigh effort

frobisher 11 hours ago

Costs are very opaque from within the product...

jerrygoyal a day ago

It's actually a huge update for building products, given most tasks are sub-agent driven where Sonnet is used, steered by Opus.

OsrsNeedsf2P a day ago

Great timing. I just started using Claude Sonnet as a long term reverse engineering project[0] for a game I used to play as a kid. The cheaper tokens but sufficiently smart with hard verification makes it a perfect combo for the task

[0] https://github.com/dginovker/BFME-Source-Code/

caste 20 hours ago

idk, i think they just tried to compensate for the ban of fable, nothing too good

docproof a day ago

The jump in reasoning quality is noticeable. What's interesting is how it handles ambiguous instructions now — it seems to ask fewer clarifying questions and just makes a reasonable judgment call. That's a double-edged sword depending on your use case.

oybng 20 hours ago

In my case, 4.6 degraded massively over time. 5 fails the same basic tasks that I gave 4.6 yesterday. And quite frankly this low, med, high, extra, max, turbo, ultra, ludicrous nonsense is getting tiresome

jchw a day ago

American AI company status: We are now bragging about how bad our models are unironically.

Okay.

Scroll_Swe a day ago

I don't pay so I'm glad for the upgrade. I usually use Gemini, Mistral Le Chat (Vibe...) or Deepseek as they have way more generous free limits and I can basically spam forever.

smallerfish a day ago

Ah that's why Opus has been so slow for the last couple of days.

joaohaas a day ago

Important to note that the cost graphs are heavily distorted. The agentic serch one for example is divided into 3 'columns': $0-$2, $2-$5 and $5-$10.

And yet, the $2-$5 section is the widest, even though it only contains a single point.

I can't even say if this is making the product look better or not, but it sure is weird. Maybe Claude just hallucinated those splits xD

nickosh 16 hours ago

It looks good. Now waiting for Opus 5.

whh 21 hours ago

It's not Fable, but I'll take it.

tensegrist a day ago

there was a vibecoded prediction market–style page that was put up yesterday (?) that got the date exactly right i think

rib3ye a day ago

link?

partsch a day ago

maybe https://outyet.ai/models/claude-sonnet-5?

matheusmoreira 19 hours ago

Who cares about Sonnet? I want to know about Fable. Are the export restrictions really going to be permanent?

stingraycharles 19 hours ago

It’s supposed to happen when Anthropic introduces identification, which I believe is planned for mid-July.

matheusmoreira 19 hours ago

Not a US citizen. Identity verification is not going to help me.

taytus 19 hours ago

Roughly on par with GLM 5.2 at 5x the price

neonstatic 10 hours ago

I appreciate they added thinking. Sonnet used to think in the actual response, leading to a lot of unnecessary burden for me. "This thing is X, no wait, it's actually Y. Therefore..." - now it's hidden in the thinking trail, so I don't have to read it unless I want to.

prmph a day ago

So many things to think about regarding these "benchmarks":

- Do the ever increasing scores on the mean we will soon have models that approach 100%? And what would that even mean? That there is no more room for improvement?

- Would Anthropic (or any other model vendor for that matter) ever release a newer model that scores lower? If not, does that mean they keep tweaking a new model they want to release until it shows an improvement of the prior model?

- Would it be more useful to move toward a comparative rather than absolute ranking?

ai_fry_ur_brain 21 hours ago

Finally a model release where everyone is realising the scam. The world is healing (maybe).

Foobar8568 10 hours ago

And Anthropic put that shit model as default, after a single prompt I was wondering what was the shit it was spouting, and yes, Sonnet 5.

micromacrofoot a day ago

So they repackaged Fable and added "don't scare the government" to the prompt

actionfromafar 21 hours ago

This is downvoted, but how can it not be a little true?

docheinestages a day ago

Is it just me or is there a huge difference between how much one can accomplish in a 5-hour window with GPT 5.5 on xhigh versus any Claude model?

mrcwinn a day ago

I exclusively use 5.5-xhigh-fast within Codex and find it superior to Opus 4.8.

PeterStuer a day ago

Anyone else feel like Opus 4.8 got significantly dumber over the last 2 weeks?

syngrog66 15 hours ago

I'd rather upgrade myself to a more effective version, thanks. in part because I have a monopoly in the market on providing Me

_pdp_ a day ago

Too expensive?

oezen 13 hours ago

opus is better

guelo 21 hours ago

Have they ever said what the difference is between Sonnet and Opus? Are they trained differently? Different architectures? Is Sonnet a distillation? Is it just that Sonnet has less resources for inference?

None of the other labs are doing this kind of long lived two model series.

jsnell 21 hours ago

Gemini has had Pro and Flash since May 2024, across three major version nunmbers. The Opus and Sonnet naming is only two months older than that.

artursapek 21 hours ago

I run a proofreading benchmark that tests how well models can find and fix errors in English text. They get several passes in a simple agent loop. Sonnet 5 is definitely better than Sonnet 4.6, but inferior on both quality and cost to GLM 5.1, GLM 5.2, Gemini 3.1 Flash, and Gemini 3.1 Pro. https://revise.io/errata-bench

impodimium 14 hours ago

Eh still looks like it is weaker than Opus 4.8 but maybe a good replacement for Sonnet 4.x

gverrilla a day ago

Is this the default model for non-paying users? If so, that could be an interesting move in the competition for this segment.

ekjhgkejhgk a day ago

In effective terms they're lowering prices.

ClaudioCronin 12 hours ago

nice!

moomin a day ago

I feel like this is a bit of a disappointment. Sonnet 4 was a clear step above Opus 3.x, while this is a lot muddier.

andrewchambers a day ago

The whole fable fiasco really soured me on Anthropic. This just looks disappointing by comparison.

mesmertech a day ago

Ok thats a one month clock to the next Opus model at least, so thats a silver lining to a meh model.

varispeed a day ago

What is the point if it is one Trump's brain fart away from being blocked?

botfriendsarent 20 hours ago

Sonnet 5 OUCH! every model is just loaded with more hurt, stolen content, BS prompts, more scare tactics, more illusions, more government lobbying, less honesty.

Oh Claude you master of software engineering does it ever end? DO you have no bounds?

How may we further assist you oh Claude?

m3kw9 20 hours ago

should have called it 4.9, it don't deserve the 5 monkeier

stackedinserter a day ago

"Our new model is proudly dumber now!"

mwigdahl a day ago

What? If you're comparing their models in the same size class, Sonnet 5 is Pareto-optimal over Sonnet 4.6.

zamadatix a day ago

I think they mean per dollar in the perf/$charts, not per marketing class. I.e. the new model is a complete Pareto failure in said perf/$ charts with the sole exception of Sonnet 5 low, which is dumb enough to not have comparison at all. Opus 4.8 delivers a better outcome per dollar, regardless what the underlying size of the models is.

I'd generously assume this is something about the specific category of agentic task presented in the chart... but it does raise the question "then why is that category the one they chose to highlight here".

mwigdahl a day ago

Madmallard 15 hours ago

Claude thread top of HN

loads of trust me bro benchmarks

financially incentivized comments and upvote/downvoting patterns

it's all slop

Getchowned a day ago

Fable soon please.

kvetching 20 hours ago

GLM 5.2 is better and cheaper. Maybe they are trying to embarrass Trump by making it look like we are losing to China.

jongjinchoi 9 hours ago

I think so. GLM 5.2 is more reasonable.

kvetching 19 hours ago

And it worked. BREAKING: The export controls on Claude Fable 5 are expected to be lifted tonight, per Politico!

lucynight a day ago

AMAZING

Hacker News

by Ryan Harman

Claude Sonnet 5 (anthropic.com)

Jcampuzano2 a day ago [-]

itopaloglu83 a day ago [-]

scrollop 12 hours ago [-]

Traubenfuchs 9 hours ago [-]

ngruhn 8 hours ago [-]

comboy 4 hours ago [-]

haspok 5 hours ago [-]

notnaut 6 hours ago [-]

somenameforme 4 hours ago [-]

Terretta 4 hours ago [-]

__natty__ 20 hours ago [-]

Aeolun 19 hours ago [-]

indoordin0saur 17 hours ago [-]

anygivnthursday 4 hours ago [-]

Terretta 3 hours ago [-]

post-it a day ago [-]

benny_s 5 hours ago [-]

anygivnthursday 4 hours ago [-]

Traubenfuchs 9 hours ago [-]

anygivnthursday 4 hours ago [-]

3ffs 17 hours ago [-]

nicce a day ago [-]

ChrisLTD 20 hours ago [-]

theptip 15 hours ago [-]

darkwater 7 hours ago [-]

phainopepla2 21 hours ago [-]

port11 2 hours ago [-]

c0m47053 19 hours ago [-]

licjon 6 hours ago [-]

annjose 3 hours ago [-]

hollownobody 5 hours ago [-]

SirMaster a day ago [-]

theptip 15 hours ago [-]

frozeus 8 hours ago [-]

LUmBULtERA 7 hours ago [-]

southforgeai 6 hours ago [-]

fluidcruft 6 hours ago [-]

enraged_camel a day ago [-]

humanymous a day ago [-]

conradkay a day ago [-]

sixtyj a day ago [-]

springtimesun 39 minutes ago [-]

lawrjone 5 hours ago [-]

ACCount37 4 hours ago [-]

Traubenfuchs 8 hours ago [-]

jamesrcole 7 hours ago [-]

Retr0id a day ago [-]

JacobAsmuth a day ago [-]

usef- 19 hours ago [-]

loufe a day ago [-]

tripleee a day ago [-]

sejje 5 hours ago [-]

conradkay 20 hours ago [-]

tiahura a day ago [-]

sejje 5 hours ago [-]

microtonal a day ago [-]

Brendinooo a day ago [-]

everforward a day ago [-]

keeda a day ago [-]

overgard a day ago [-]

tskj a day ago [-]

sanderjd a day ago [-]

ricardobayes 21 hours ago [-]

JumpCrisscross a day ago [-]

EddieRingle 21 hours ago [-]

pkulak a day ago [-]

ricardobayes a day ago [-]

rconti 21 hours ago [-]

popalchemist a day ago [-]

pigpop 21 hours ago [-]

quaverquaver 21 hours ago [-]

jatora a day ago [-]

jambalaya8 a day ago [-]

jerf a day ago [-]

nozzlegear a day ago [-]

sparkling 12 hours ago [-]

nozzlegear an hour ago [-]

Jcampuzano2 a day ago

itopaloglu83 a day ago

scrollop 12 hours ago

Traubenfuchs 9 hours ago

ngruhn 8 hours ago

comboy 4 hours ago

haspok 5 hours ago

notnaut 6 hours ago

somenameforme 4 hours ago

Terretta 4 hours ago

natty 20 hours ago

Aeolun 19 hours ago

indoordin0saur 17 hours ago

anygivnthursday 4 hours ago

Terretta 3 hours ago

post-it a day ago

benny_s 5 hours ago

anygivnthursday 4 hours ago

Traubenfuchs 9 hours ago

anygivnthursday 4 hours ago

3ffs 17 hours ago

nicce a day ago

ChrisLTD 20 hours ago

theptip 15 hours ago

darkwater 7 hours ago

phainopepla2 21 hours ago

port11 2 hours ago

c0m47053 19 hours ago

licjon 6 hours ago

annjose 3 hours ago

hollownobody 5 hours ago

SirMaster a day ago

theptip 15 hours ago

frozeus 8 hours ago

LUmBULtERA 7 hours ago

southforgeai 6 hours ago

fluidcruft 6 hours ago

enraged_camel a day ago

humanymous a day ago

conradkay a day ago

sixtyj a day ago

springtimesun 39 minutes ago

lawrjone 5 hours ago

ACCount37 4 hours ago

Traubenfuchs 8 hours ago

jamesrcole 7 hours ago

Retr0id a day ago

JacobAsmuth a day ago

usef- 19 hours ago

loufe a day ago

tripleee a day ago

sejje 5 hours ago

conradkay 20 hours ago

tiahura a day ago

sejje 5 hours ago

microtonal a day ago

Brendinooo a day ago

everforward a day ago

keeda a day ago

overgard a day ago

tskj a day ago

sanderjd a day ago

ricardobayes 21 hours ago

JumpCrisscross a day ago

EddieRingle 21 hours ago

pkulak a day ago

ricardobayes a day ago

rconti 21 hours ago

popalchemist a day ago

pigpop 21 hours ago

quaverquaver 21 hours ago

jatora a day ago

jambalaya8 a day ago

jerf a day ago

nozzlegear a day ago

sparkling 12 hours ago

nozzlegear an hour ago

kingnothing 3 hours ago

plasticsoprano a day ago

nozzlegear 21 hours ago