Claude Opus 4.6 (anthropic.com)

1246 points by HellsMaddy 5 hours ago

ck_one an hour ago

Just tested the new Opus 4.6 (1M context) on a fun needle-in-a-haystack challenge: finding every spell in all Harry Potter books.

All 7 books come to ~1.75M tokens, so they don't quite fit yet. (At this rate of progress, mid-April should do it ) For now you can fit the first 4 books (~733K tokens).

Results: Opus 4.6 found 49 out of 50 officially documented spells across those 4 books. The only miss was "Slugulus Eructo" (a vomiting spell).

Freaking impressive!

LanceJones a few seconds ago

Assuming this experiment involved isolating the LLM from its training set?

bartman 2 minutes ago

Have you by any chance tried this with GPT 4.1 too (also 1M context)?

xiomrze 28 minutes ago

Honest question, how do you know if it's pulling from context vs from memory?

If I use Opus 4.6 with Extended Thinking (Web Search disabled, no books attached), it answers with 130 spells.

ck_one a minute ago

When I tried it without web search so only internal knowledge it missed ~15 spells.

petercooper 17 minutes ago

One possible trick could be to search and replace them all with nonsense alternatives then see if it extracts those.

andai 5 minutes ago

ozim 10 minutes ago

Exactly there was this study where they were trying to make LLM reproduce HP book word for word like giving first sentences and letting it cook.

Basically they managed with some tricks make 99% word for word - tricks were needed to bypass security measures that are there in place for exactly reason to stop people to retrieve training material.

clanker_fluffer 18 minutes ago

What was your prompt?

meroes 28 minutes ago

What is this supposed to show exactly? Those books have been feed into LLMs for years and there's even likely specific RLHF's on extracting spells from HP.

muzani 2 minutes ago

There was a time when I put the EA-Nasir text into base64 and asked AI to convert it. Remarkably it identified the correct text but pulled the most popular translation of the text than the one I gave it.

rvz 11 minutes ago

> What is this supposed to show exactly?

Nothing.

You can be sure that this was already known in the training data of PDFs, books and websites that Anthropic scraped to train Claude on; hence 'documented'. This is why tests like what the OP just did is meaningless.

Such "benchmarks" are performative to VCs and they do not ask why isn't the research and testing itself done independently but is almost always done by their own in-house researchers.

zamadatix 43 minutes ago

To be fair, I don't think "Slugulus Eructo" (the name) is actually in the books. This is what's in my copy:

> The smug look on Malfoy’s face flickered.

> “No one asked your opinion, you filthy little Mudblood,” he spat.

> Harry knew at once that Malfoy had said something really bad because there was an instant uproar at his words. Flint had to dive in front of Malfoy to stop Fred and George jumping on him, Alicia shrieked, “How dare you!”, and Ron plunged his hand into his robes, pulled out his wand, yelling, “You’ll pay for that one, Malfoy!” and pointed it furiously under Flint’s arm at Malfoy’s face.

> A loud bang echoed around the stadium and a jet of green light shot out of the wrong end of Ron’s wand, hitting him in the stomach and sending him reeling backward onto the grass.

> “Ron! Ron! Are you all right?” squealed Hermione.

> Ron opened his mouth to speak, but no words came out. Instead he gave an almighty belch and several slugs dribbled out of his mouth onto his lap.

ck_one 4 minutes ago

Then it's fair that id didn't find it

guluarte 16 minutes ago

you can get the same result just asking opus/gpt, it is probably internalized knowledge from reddit or similar sites.

ck_one 2 minutes ago

If you just ask it you don't get the same result. Around 13 spells were missing when I just prompted Opus 4.6 without the books as context.

hbarka 26 minutes ago

If you wanted to fit all 7 books, would you use RAG or another solution?

gizmodo59 4 hours ago

5.3 codex https://openai.com/index/introducing-gpt-5-3-codex/ crushes with a 77.3% in Terminal Bench. The shortest lived lead in less than 35 minutes. What a time to be alive!

wasmainiac 3 hours ago

Dumb question. Can these benchmarks be trusted when the model performance tends to vary depending on the hours and load on OpenAI’s servers? How do I know I’m not getting a severe penalty for chatting at the wrong time. Or even, are the models best after launch then slowly eroded away at to more economical settings after the hype wears off?

tedsanders 2 hours ago

We don't vary our model quality with time of day or load (beyond negligible non-determinism). It's the same weights all day long with no quantization or other gimmicks. They can get slower under heavy load, though.

(I'm from OpenAI.)

Trufa 2 hours ago

zamadatix an hour ago

Someone1234 2 hours ago

Corence 3 hours ago

It is a fair question. I'd expect the numbers are all real. Competitors are going to rerun the benchmark with these models to see how the model is responding and succeeding on the tasks and use that information to figure out how to improve their own models. If the benchmark numbers aren't real their competitors will call out that it's not reproducible.

However it's possible that consumers without a sufficiently tiered plan aren't getting optimal performance, or that the benchmark is overfit and the results won't generalize well to the real tasks you're trying to do.

mrandish 13 minutes ago

ifwinterco 3 hours ago

On benchmarks GPT 5.2 was roughly equivalent to Opus 4.5 but most people who've used both for SWE stuff would say that Opus 4.5 is/was noticeably better

CraigJPerry 2 hours ago

SatvikBeri 29 minutes ago

georgeven 3 hours ago

elAhmo 3 hours ago

smcleod an hour ago

I don't think much from OpenAI can be trusted tbh.

aaaalone 3 hours ago

At the end of the day you test it for your use cases anyway but it makes it a great initial hint if it's worth it to test out.

cyanydeez 3 hours ago

When do you think we should run this benchmark? Friday, 1pm? Monday 8AM? Wednesday 11AM?

I definitely suspect all these models are being degraded during heavy loads.

j_maffe 3 hours ago

thinkingtoilet an hour ago

We know Open AI got caught getting benchmark data and tuning their models to it already. So the answer is a hard no. I imagine over time it gives a general view of the landscape and improvements, but take it with a large grain of salt.

purplerabbit 4 hours ago

The lack of broad benchmark reports in this makes me curious: Has OpenAI reverted to benchmaxxing? Looking forward to hearing opinions once we all try both of these out

MallocVoidstar 3 hours ago

The -codex models are only for 'agentic coding', nothing else.

nharada 4 hours ago

That's a massive jump, I'm curious if there's a materially different feeling in how it works or if we're starting to reach the point of benchmark saturation. If the benchmark is good then 10 points should be a big improvement in capability...

jkelleyrtp 4 hours ago

claude swe-bench is 80.8 and codex is 56.8

Seems like 4.6 is still all-around better?

gizmodo59 4 hours ago

Its SWE bench pro not swe bench verified. The verified benchmark has stagnated

joshuahedlund 4 hours ago

Rudybega 35 minutes ago

You're comparing two different benchmarks. Pro vs Verified.

pjot 5 hours ago

Claude Code release notes:

  > Version 2.1.32:
     • Claude Opus 4.6 is now available!
     • Added research preview agent teams feature for multi-agent collaboration (token-intensive feature, requires setting
     CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1)
     • Claude now automatically records and recalls memories as it works
     • Added "Summarize from here" to the message selector, allowing partial conversation summarization.
     • Skills defined in .claude/skills/ within additional directories (--add-dir) are now loaded automatically.
     • Fixed @ file completion showing incorrect relative paths when running from a subdirectory
     • Updated --resume to re-use --agent value specified in previous conversation by default.
     • Fixed: Bash tool no longer throws "Bad substitution" errors when heredocs contain JavaScript template literals like ${index + 1}, which
     previously interrupted tool execution
     • Skill character budget now scales with context window (2% of context), so users with larger context windows can see more skill descriptions
     without truncation
     • Fixed Thai/Lao spacing vowels (สระ า, ำ) not rendering correctly in the input field
     • VSCode: Fixed slash commands incorrectly being executed when pressing Enter with preceding text in the input field
     • VSCode: Added spinner when loading past conversations list

neuronexmachina 4 hours ago

> Claude now automatically records and recalls memories as it works

Neat: https://code.claude.com/docs/en/memory

I guess it's kind of like Google Antigravity's "Knowledge" artifacts?

bityard 3 hours ago

If it works anything like the memories on Copilot (which have been around for quite a while), you need to be pretty explicit about it being a permanent preference for it to be stored as a memory. For example, "Don't use emoji in your response" would only be relevant for the current chat session, whereas this is more sticky: "I never want to see emojis from you, you sub-par excuse for a roided-out spreadsheet"

flutas an hour ago

9dev 2 hours ago

om8 4 hours ago

Is there a way to disable it? Sometimes I value agent not having knowledge that it needs to cut corners

nerdsniper 3 hours ago

kzahel 2 hours ago

4b11b4 an hour ago

I understand everyone's trying to solve this problem but I'm envisioning 1 year down the line when your memory is full of stuff that shouldn't be in there.

pdntspa an hour ago

I thought it was already doing this?

I asked Claude UI to clear its memory a little while back and hoo boy CC got really stupid for a couple of days

codethief 4 hours ago

Are we sure the docs page has been updated yet? Because that page doesn't say anything about automatic recording of memories.

neuronexmachina 3 hours ago

kzahel 2 hours ago

I looked into it a bit. It stores memories near where it stores JSONL session history. It's per-project (and specific to the machine) Claude pretty aggressively and frequently writes stuff in there. It uses MEMORY.md as sort of the index, and will write out other files with other topics (linking to them from the main MEMORY.md) file.

It gives you a convenient way to say "remember this bug for me, we should fix tomorrow". I'll be playing around with it more for sure.

I asked Claude to give me a TLDR (condensed from its system prompt):

----

Persistent directory at ~/.claude/projects/{project-path}/memory/, persists across conversations

MEMORY.md is always injected into the system prompt; truncated after 200 lines, so keep it concise

Separate topic files for detailed notes, linked from MEMORY.md What to record: problem constraints, strategies that worked/failed, lessons learned

Proactive: when I hit a common mistake, check memory first - if nothing there, write it down

Maintenance: update or remove memories that are wrong or outdated

Organization: by topic, not chronologically

Tools: use Write/Edit to update (so you always see the tool calls)

ra7 17 minutes ago

simonw 5 hours ago

The bicycle frame is a bit wonky but the pelican itself is great: https://gist.github.com/simonw/a6806ce41b4c721e240a4548ecdbe...

stkai 4 hours ago

Would love to find out they're overfitting for pelican drawings.

andy_ppp 3 hours ago

Yes, Racoon on a unicycle? Magpie on a pedalo?

throw310822 an hour ago

theanonymousone an hour ago

Even if not intentionally, it is probably leaking into training sets.

fragmede 3 hours ago

The estimation I did 4 months ago:

> there are approximately 200k common nouns in English, and then we square that, we get 40 billion combinations. At one second per, that's ~1200 years, but then if we parallelize it on a supercomputer that can do 100,000 per second that would only take 3 days. Given that ChatGPT was trained on all of the Internet and every book written, I'm not sure that still seems infeasible.

https://news.ycombinator.com/item?id=45455786

eli 2 hours ago

AnimalMuppet an hour ago

gcanyon 4 hours ago

One aspect of this is that apparently most people can't draw a bicycle much better than this: they get the elements of the frame wrong, mess up the geometry, etc.

arionmiles 2 hours ago

There's a research paper from the University of Liverpool, published in 2006 where researchers asked people to draw bicycles from memory and how people overestimate their understanding of basic things. It was a very fun and short read.

It's called "The science of cycology: Failures to understand how everyday objects work" by Rebecca Lawson.

https://link.springer.com/content/pdf/10.3758/bf03195929.pdf

devilcius 33 minutes ago

rcxdude 2 hours ago

gnatolf 4 hours ago

Absolutely. A technically correct bike is very hard to draw in SVG without going overboard in details

falloutx 3 hours ago

RussianCow 2 hours ago

nateglims 2 hours ago

I just had an idea for an RLVR startup.

cyanydeez 3 hours ago

Yes, but obviously AGI will solve this by, _checks notes_ more TerraWatts!

hackernudes 3 hours ago

seanhunter 3 hours ago

franze 2 hours ago

gryfft 2 hours ago

That's hilarious. It's so close!

etwigg 35 minutes ago

If we do get paperclipped, I hope it is of the "cycling pelican" variety. Thanks for your important contribution to alignment Simon!

zahlman an hour ago

Do you find that word choices like "generate" (as opposed to "create", "author", "write" etc.) influence the model's success?

Also, is it bad that I almost immediately noticed that both of the pelican's legs are on the same side of the bicycle, but I had to look up an image on Wikipedia to confirm that they shouldn't have long necks?

Also, have you tried iterating prompts on this test to see if you can get more realistic results? (How much does it help to make them look up reference images first?)

einrealist 4 hours ago

They trained for it. That's the +0.1!

beemboy 2 hours ago

Isn't there a point at which it trains itself on these various outputs, or someone somewhere draws one and feeds it into the model so as to pass this benchmark?

athrowaway3z 4 hours ago

This benchmark inspired me to have codex/claude build a DnD battlemap tool with svg's.

They got surprisingly far, but i did need to iterate a few times to have it build tools that would check for things like; dont put walls on roads or water.

What I think might be the next obstacle is self-knowledge. The new agents seem to have picked up ever more vocabulary about their context and compaction, etc.

As a next benchmark you could try having 1 agent and tell it to use a coding agent (via tmux) to build you a pelican.

eaf7e281 4 hours ago

There's no way they actually work on training this.

margalabargala 4 hours ago

I suspect they're training on this.

I asked Opus 4.6 for a pelican riding a recumbent bicycle and got this.

https://i.imgur.com/UvlEBs8.png

WarmWash 4 hours ago

mrandish 4 hours ago

riffraff 3 hours ago

KeplerBoy 4 hours ago

There is no way they are not training on this.

collinmanderson 4 hours ago

fragmede 3 hours ago

The people that work at Anthropic are aware of simonw and his test, and people aren't unthinking data-driven machines. How valid his test is or isn't, a better score on it is convincing. If it gets, say, 1,000 people to use Claude Code over Codex, how much would that be worth to Anthropic?

$200 * 1,000 = $200k/month.

I'm not saying they are, but to say that they aren't with such certainty, when money is on the line; unless you have some insider knowledge you'd like to share with the rest of the class, it seems like an questionable conclusion.

bityard 3 hours ago

Well, the clouds are upside-down, so I don't think I can give it a pass.

hoeoek 4 hours ago

This really is my favorite benchmark

nine_k 3 hours ago

I suppose the pelican must be now specifically trained for, since it's a well-known benchmark.

copilot_king_2 4 hours ago

I'm firing all of my developers this afternoon.

RGamma 3 hours ago

Opus 6 will fire you instead for being too slow with the ideas.

insane_dreamer 2 hours ago

Too late. You’ve already been fired by a moltbot agent from your PHB.

7777777phil 4 hours ago

best pelican so far would you say? Or where does it rank in the pelican benchmark?

mrandish 4 hours ago

In other words, is it a pelican or a pelican't?

canadiantim an hour ago

nubg 5 hours ago

What about the Pelo2 benchmark? (the gray bird that is not gray)

6thbit 2 hours ago

do you have a gif? i need an evolving pelican gif

risyachka 2 hours ago

Pretty sure at this point they train it on pelicans

ares623 5 hours ago

Can it draw a different bird on a bike?

simonw 4 hours ago

Here's a kākāpō riding a bicycle instead: https://gist.github.com/simonw/19574e1c6c61fc2456ee413a24528...

I don't think it quite captures their majesty: https://en.wikipedia.org/wiki/K%C4%81k%C4%81p%C5%8D

zahlman an hour ago

DetroitThrow 5 hours ago

The ears on top are a cute touch

fullstackchris an hour ago

[flagged]

dang 16 minutes ago

Personal attacks are not allowed on HN. No more of this, please.

surajkumar5050 2 hours ago

I think two things are getting conflated in this discussion.

First: marginal inference cost vs total business profitability. It’s very plausible (and increasingly likely) that OpenAI/Anthropic are profitable on a per-token marginal basis, especially given how cheap equivalent open-weight inference has become. Third-party providers are effectively price-discovering the floor for inference.

Second: model lifecycle economics. Training costs are lumpy, front-loaded, and hard to amortize cleanly. Even if inference margins are positive today, the question is whether those margins are sufficient to pay off the training run before the model is obsoleted by the next release. That’s a very different problem than “are they losing money per request”.

Both sides here can be right at the same time: inference can be profitable, while the overall model program is still underwater. Benchmarks and pricing debates don’t really settle that, because they ignore cadence and depreciation.

IMO the interesting question isn’t “are they subsidizing inference?” but “how long does a frontier model need to stay competitive for the economics to close?”

raincole 2 hours ago

> the interesting question isn’t “are they subsidizing inference?”

The interesting question is if they are subsidizing the $200/mo plan. That's what is supporting the whole vibecoding/agentic coding thing atm. I don't believe Claude Code would have taken off if it were token-by-token from day 1.

(My baseless bet is that they're, but not by much and the price will eventually rise by perhaps 2x but not 10x.)

jmalicki 2 hours ago

I suspect they're marginally profitable on API cost plans.

But the max 20x usage plans I am more skeptical of. When we're getting used to $200 or $400 costs per developer to do aggressive AI-assisted coding, what happens when those costs go up 20x? what is now $5k/yr to keep a Codex and a Claude super busy and do efficient engineering suddenly becomes $100k/yr... will the costs come down before then? Is the current "vibe-coding renaissance" sustainable in that regime?

slopusila 2 minutes ago

after the models get good enough to replace coders they will be able to start increasing the subscriptions back up

rstuart4133 an hour ago

> It’s very plausible (and increasingly likely) that OpenAI/Anthropic are profitable on a per-token marginal basis

There any many places that will not use models running on hardware provided by OpenAI / Anthropic. That is the case true of my (the Australian) government at all levels. They will only use models running in Australia.

Consequently AWS (and I presume others) will run models supplied by the AI companies for you in their data centres. They won't be doing that at a loss, so the price will cover marginal cost of the compute plus renting the model. I know from devs using and deploying the service demand outstrips supply. Ergo, I don't think there is much doubt that they are making money from inference.

BosunoB 2 hours ago

Dario said this in a podcast somewhere. The models themselves have so far been profitable if you look at their lifetime costs and revenue. Annual profitability just isn't a very good lens for AI companies because costs all land in one year and the revenue all comes in the next. Prolific AI haters like Ed Zitron make this mistake all the time.

jmalicki an hour ago

Do you have a specific reference? I'm curious to see hard data and models.... I think this makes sense, but I haven't figured out how to see the numbers or think about it.

BosunoB an hour ago

jmatthiass 29 minutes ago

In his recent appearance on NYT Dealbook, he definitely made it seem like inference was sustainable, if not flat-out profitable.

https://www.youtube.com/live/FEj7wAjwQIk

w10-1 an hour ago

"how long does a frontier model need to stay competitive"

Remember "worse is better". The model doesn't have to be the best; it just has to be mostly good enough, and used by everyone -- i.e., where switching costs would be higher than any increase in quality. Enterprises would still be on Java if the operating costs of native containers weren't so much cheaper.

So it can make sense to be ok with losing money with each training generation initially, particularly when they are being driven by specific use-cases (like coding). To the extent they are specific, there will be more switching costs.

legitster 4 hours ago

I'm still not sure I understand Anthropic's general strategy right now.

They are doing these broad marketing programs trying to take on ChatGPT for "normies". And yet their bread and butter is still clearly coding.

Meanwhile, Claude's general use cases are... fine. For generic research topics, I find that ChatGPT and Gemini run circles around it: in the depth of research, the type of tasks it can handle, and the quality and presentation of the responses.

Anthropic is also doing all of these goofy things to try to establish the "humanity" of their chatbot - giving it rights and a constitution and all that. Yet it weirdly feels the most transactional out of all of them.

Don't get me wrong, I'm a paying Claude customer and love what it's good at. I just think there's a disconnect between what Claude is and what their marketing department thinks it is.

tgtweak 4 hours ago

Claude itself (outside of code workflows) actually works very well for general purpose chat. I have a few non-technical friends that have moved over from chatgpt after some side-by-side testing and I've yet to see one go back - which is good since claude circa 8 months ago was borderline unusable for anything but coding on the api.

eaf7e281 4 hours ago

I kinda agree. Their model just doesn't feel "daily" enough. I would use it for any "agentic" tasks and for using tools, but definitely not for day to day questions.

lukebechtel 4 hours ago

Why? I use it for all and love it.

That doesn't mean you have to, but I'm curious why you think it's behind in the personal assistant game.

legitster 4 hours ago

eaf7e281 an hour ago

solarkraft 4 hours ago

But that’s what makes it so powerful (yeah, mixing model and frontend discussion here yet again). I have yet to see a non-DIY product that can so effortlessly call tens of tools by different providers to satisfy your request.

Squarex 2 hours ago

Claude sucks at non English languages. Gemini and ChatGPT are much better. Grok is the worst. I am a native Czech speaker and Claude makes up words and Grok sometimes respond in Russian. So while I love it for coding, it’s unusable for general purpose for me.

jorl17 28 minutes ago

Claude is quite good at European Portuguese in my limited tests. Gemini 3 is also very good. ChatGPT is just OK and keeps code-switching all the time, it's very bizarre.

I used to think of Gemini as the lead in terms of Portuguese, but recently subjectively started enjoying Claude more (even before Opus 4.5).

In spite of this, ChatGPT is what I use for everyday conversational chat because it has loads of memories there, because of the top of the line voice AI, and, mostly, because I just brainstorm or do 1-off searches with it. I think effectively ChatGPT is my new Google and first scratchpad for ideas.

9dev an hour ago

> Grok sometimes respond in Russian

Geopolitically speaking this is hilarious.

Squarex 36 minutes ago

kuboble 37 minutes ago

Claude code (opus) is very good in Polish.

I sometimes vibe code in polish and it's as good as with English for me. It speaks a natural, native level Polish.

I used opus to translate thousands of strings in my app into polish, Korean, and two Chinese dialects. Polish one is great, and the other are also good according to my customers.

blibble 5 hours ago

> We build Claude with Claude. Our engineers write code with Claude Code every day

well that explains quite a bit

jsheard 5 hours ago

CC has >6000 open issues, despite their bot auto-culling them after 60 days of inactivity. It was ~5800 when I looked just a few days ago so they seem to be accelerating towards some kind of bug singularity.

dkersten 2 hours ago

Just anecdotally, each release seems to be buggier than the last.

To me, their claim that they are vibe coding Claude code isn’t the flex they think it is.

I find it harder and harder to trust anthropic for business related use and not just hobby tinkering. Between buggy releases, opaque and often seemingly glitches rate limits and usage limits, and the model quality inconsistency, it’s just not something I’d want to bet a business on.

zahlman an hour ago

tgtweak 4 hours ago

plot twist, it's all claude code instances submitting bug reports on behalf of end users.

accrual 4 hours ago

paxys 4 hours ago

Half of them were probably opened yesterday during the Claude outage.

anematode 4 hours ago

elAhmo 3 hours ago

Insane to think that a relatively simple CLI tool has so many open issues...

emilsedgh 2 hours ago

trymas 2 hours ago

dwaltrip 2 hours ago

raincole 4 hours ago

It explains how important dogfooding is if you want to make an extremely successful product.

jama211 5 hours ago

It’s extremely successful, not sure what it explains other than your biases

blibble 4 hours ago

Microsoft's products are also extremely successful

they're also total garbage

simianwords 4 hours ago

holoduke 2 hours ago

acedTrex 3 hours ago

Something being successful and something being a high quality product with good engineering are two completely different questions.

mvdtnz 4 hours ago

Anthropic has perhaps the most embarrassing status page history I have ever seen. They are famous for downtime.

https://status.claude.com/

ronsor 4 hours ago

djeastm an hour ago

Computer0 3 hours ago

dimgl 4 hours ago

cedws 4 hours ago

The sandboxing in CC is an absolute joke, it's no wonder there's an explosion of sandbox wrappers at the moment. There's going to be a security catastrophe at some point, no doubt about it.

gjsman-1000 5 hours ago

Also explains why Claude Code is a React app outputting to a Terminal. (Seriously.)

krystofbe 2 hours ago

I did some debugging on this today. The results are... sobering.

Memory comparison of AI coding CLIs (single session, idle):

  | Tool        | Footprint | Peak   | Language      |
  |-------------|-----------|--------|---------------|
  | Codex       | 15 MB     | 15 MB  | Rust          |
  | OpenCode    | 130 MB    | 130 MB | Go            |
  | Claude Code | 360 MB    | 746 MB | Node.js/React |
That's a 24x to 50x difference for tools that do the same thing: send text to an API.

vmmap shows Claude Code reserves 32.8 GB virtual memory just for the V8 heap, has 45% malloc fragmentation, and a peak footprint of 746 MB that never gets released, classic leak pattern.

On my 16 GB Mac, a "normal" workload (2 Claude sessions + browser + terminal) pushes me into 9.5 GB swap within hours. My laptop genuinely runs slower with Claude Code than when I'm running local LLMs.

I get that shipping fast matters, but building a CLI with React and a full Node.js runtime is an architectural choice with consequences. Codex proves this can be done in 15 MB. Every Claude Code session costs me 360+ MB, and with MCP servers spawning per session, it multiplies fast.

atonse 33 minutes ago

Weryj 2 hours ago

jama211 5 hours ago

There’s nothing wrong with that, except it lets ai skeptics feel superior

RohMin 4 hours ago

overgard an hour ago

3836293648 4 hours ago

exe34 4 hours ago

krona 4 hours ago

Sounds like a web developer defined the solution a year before they knew what the problem was.

sweetheart 4 hours ago

React's core is agnostic when it comes to the actual rendering interface. It's just all the fancy algos for diffing and updating the underlying tree. Using it for rendering a TUI is a very reasonable application of the technology.

skydhash 3 hours ago

thehamkercat 5 hours ago

Same with opencode and gemini, it's disgusting

Codex (by openai ironically) seems to be the fastest/most-responsive, opens instantly and is written in rust but doesn't contain that many features

Claude opens in around 3-4 seconds

Opencode opens in 2 seconds

Gemini-cli is an abomination which opens in around 16 second for me right now, and in 8 seconds on a fresh install

Codex takes 50ms for reference...

--

If their models are so good, why are they not rewriting their own react in cli bs to c++ or rust for 100x performance improvement (not kidding, it really is that much)

g947o 4 hours ago

bdangubic 19 minutes ago

azinman2 4 hours ago

shoeb00m 4 hours ago

wahnfrieden 4 hours ago

tayo42 5 hours ago

Is this a react feature or did they build something to translate react to text for display in the terminal?

sbarre 4 hours ago

pkkim 4 hours ago

embedding-shape 4 hours ago

tayo42 4 hours ago

CooCooCaCha 5 hours ago

It’s really not that crazy.

React itself is a frontend-agnostic library. People primarily use it for writing websites but web support is actually a layer on top of base react and can be swapped out for whatever.

So they’re really just using react as a way to organize their terminal UI into components. For the same reason it’s handy to organize web ui into components.

dreamteam1 2 hours ago

CamperBob2 4 hours ago

Also explains why Claude Code is a React app outputting to a Terminal. (Seriously.)

Who cares, and why?

All of the major providers' CLI harnesses use Ink: https://github.com/vadimdemedes/ink

spruce_tips 4 hours ago

Ah yes, explains why it takes 3 seconds for a new chat to load after I click new chat in the macOS app.

exe34 4 hours ago

Can Claude fix the flicker in Claude yet?

Someone1234 5 hours ago

Does anyone with more insight into the AI/LLM industry happen to know if the cost to run them in normal user-workflows is falling? The reason I'm asking is because "agent teams" while a cool concept, it largely constrained by the economics of running multiple LLM agents (i.e. plans/API calls that make this practical at scale are expensive).

A year or more ago, I read that both Anthropic and OpenAI were losing money on every single request even for their paid subscribers, and I don't know if that has changed with more efficient hardware/software improvements/caching.

simonw 5 hours ago

The cost per token served has been falling steadily over the past few years across basically all of the providers. OpenAI dropped the price they charged for o3 to 1/5th of what it was in June last year thanks to "engineers optimizing inferencing", and plenty of other providers have found cost savings too.

Turns out there was a lot of low-hanging fruit in terms of inference optimization that hadn't been plucked yet.

> A year or more ago, I read that both Anthropic and OpenAI were losing money on every single request even for their paid subscribers

Where did you hear that? It doesn't match my mental model of how this has played out.

cootsnuck 4 hours ago

I have not see any reporting or evidence at all that Anthropic or OpenAI is able to make money on inference yet.

> Turns out there was a lot of low-hanging fruit in terms of inference optimization that hadn't been plucked yet.

That does not mean the frontier labs are pricing their APIs to cover their costs yet.

It can both be true that it has gotten cheaper for them to provide inference and that they still are subsidizing inference costs.

In fact, I'd argue that's way more likely given that has been precisely the goto strategy for highly-competitive startups for awhile now. Price low to pump adoption and dominate the market, worry about raising prices for financial sustainability later, burn through investor money until then.

What no one outside of these frontier labs knows right now is how big the gap is between current pricing and eventual pricing.

chis 4 hours ago

NitpickLawyer 4 hours ago

mrandish 4 hours ago

barrkel 4 hours ago

nubg 4 hours ago

> "engineers optimizing inferencing"

are we sure this is not a fancy way of saying quantization?

bityard 3 hours ago

embedding-shape 4 hours ago

esafak 3 hours ago

jmalicki 4 hours ago

sumitkumar 4 hours ago

It seems it is true for gemini because they have a humongous sparse model but it isn't so true for the max performance opus-4.5/6 and gpt-5.2/3.

Aurornis 4 hours ago

> A year or more ago, I read that both Anthropic and OpenAI were losing money on every single request even for their paid subscribers

This gets repeated everywhere but I don't think it's true.

The company is unprofitable overall, but I don't see any reason to believe that their per-token inference costs are below the marginal cost of computing those tokens.

It is true that the company is unprofitable overall when you account for R&D spend, compensation, training, and everything else. This is a deliberate choice that every heavily funded startup should be making, otherwise you're wasting the investment money. That's precisely what the investment money is for.

However I don't think using their API and paying for tokens has negative value for the company. We can compare to models like DeepSeek where providers can charge a fraction of the price of OpenAI tokens and still be profitable. OpenAI's inference costs are going to be higher, but they're charging such a high premium that it's hard to believe they're losing money on each token sold. I think every token paid for moves them incrementally closer to profitability, not away from it.

3836293648 4 hours ago

The reports I remember show that they're profitable per-model, but overlap R&D so that the company is negative overall. And therefore will turn a massive profit if they stop making new models.

schnable 2 hours ago

trcf23 3 hours ago

runarberg 4 hours ago

I can see a case for omitting R&D when talking about profitability, but training makes no sense. Training is what makes the model, omitting it is like omitting the cost of running the production facility of a car manufacturer. If AI companies stop training they will stop producing models, and they will run out of a products to sell.

vidarh 2 hours ago

Aurornis 3 hours ago

nodja 2 hours ago

> A year or more ago, I read that both Anthropic and OpenAI were losing money on every single request even for their paid subscribers, and I don't know if that has changed with more efficient hardware/software improvements/caching.

This is obviously not true, you can use real data and common sense.

Just look up a similar sized open weights model on openrouter and compare the prices. You'll note the similar sized model is often much cheaper than what anthropic/openai provide.

Example: Let's compare claude 4 models with deepseek. Claude 4 is ~400B params so it's best to compare with something like deepseek V3 which is 680B params.

Even if we compare the cheapest claude model to the most expensive deepseek provider we have claude charging $1/M for input and $5/M for output, while deepseek providers charge $0.4/M and $1.2/M, a fifth of the price, you can get it as cheap as $.27 input $0.4 output.

As you can see, even if we skew things overly in favor of claude, the story is clear, claude token prices are much higher than they could've been. The difference in prices is because anthropic also needs to pay for training costs, while openrouter providers just need to worry on making serving models profitable. Deepseek is also not as capable as claude which also puts down pressure on the prices.

There's still a chance that anthropic/openai models are losing money on inference, if for example they're somehow much larger than expected, the 400B param number is not official, just speculative from how it performs, this is only taking into account API prices, subscriptions and free user will of course skew the real profitability numbers, etc.

Price sources:

https://openrouter.ai/deepseek/deepseek-v3.2-speciale

https://claude.com/pricing#api

Someone1234 2 hours ago

> This is obviously not true, you can use real data and common sense.

It isn't "common sense" at all. You're comparing several companies losing money, to one another, and suggesting that they're obviously making money because one is under-cutting another more aggressively.

LLM/AI ventures are all currently under-water with massive VC or similar money flowing in, they also all need training data from users, so it is very reasonable to speculate that they're in loss-leader mode.

nodja an hour ago

m101 3 hours ago

I think actually working out whether they are losing money is extremely difficult for current models but you can look backwards. The big uncertainties are:

1) how do you depreciate a new model? What is its useful life? (Only know this once you deprecate it)

2) how do you depreciate your hardware over the period you trained this model? Another big unknown and not known until you finally write the hardware off.

The easy thing to calculate is whether you are making money actually serving the model. And the answer is almost certainly yes they are making money from this perspective, but that’s missing a large part of the cost and is therefore wrong.

Havoc 5 hours ago

Saw a comment earlier today about google seeing a big (50%+) fall in Gemini serving cost per unit across 2025 but can’t find it now. Was either here or on Reddit

mattddowney 4 hours ago

From Alphabet 2025 Q4 Earnings call: "As we scale, we’re getting dramatically more efficient. We were able to lower Gemini serving unit costs by 78% over 2025 through model optimizations, efficiency and utilization improvements." https://abc.xyz/investor/events/event-details/2026/2025-Q4-E...

Havoc an hour ago

3abiton 5 hours ago

It's not just that. Everyone is complacent with the utilization of AI agents. I have been using AI for coding for quite a while, and most of my "wasted" time is correcting its trajectory and guiding it through the thinking process. It's very fast iterations but it can easily go off track. Claude's family are pretty good at doing chained task, but still once the task becomes too big context wise, it's impossible to get back on track. Cost wise, it's cheaper than hiring skilled people, that's for sure.

lufenialif2 4 hours ago

Cost wise, doesn’t that depend on what you could be doing besides steering agents?

cyanydeez 3 hours ago

zozbot234 5 hours ago

> i.e. plans/API calls that make this practical at scale are expensive

Local AI's make agent workflows a whole lot more practical. Making the initial investment for a good homelab/on-prem facility will effectively become a no-brainer given the advantages on privacy and reliability, and you don't have to fear rugpulls or VC's playing the "lose money on every request" game since you know exactly how much you're paying in power costs for your overall load.

vbezhenar 3 hours ago

I don't care about privacy and I didn't have much problems with reliability of AI companies. Spending ridiculous amount of money on hardware that's going to be obsolete in a few years and won't be utilized at 100% during that time is not something that many people would do, IMO. Privacy is good when it's given for free.

I would rather spend money on some pseudo-local inference (when cloud company manages everything for me and I just can specify some open source model and pay for GPU usage).

KaiserPro 3 hours ago

Gemini-pro-preview is on ollama and requires h100 which is ~$15-30k. Google are charging $3 a million tokens. Supposedly its capable of generating between 1 and 12 million tokens an hour.

Which is profitable. but not by much.

grim_io an hour ago

What do you mean it's on ollama and requires h100? As a proprietary google model, it runs on their own hardware, not nvidia.

KaiserPro 32 minutes ago

Bombthecat 4 hours ago

That's why anthropic switched to tpu, you can sell at cost.

WarmWash 3 hours ago

These are intro prices.

This is all straight out of the playbook. Get everyone hooked on your product by being cheap and generous.

Raise the price to backpay what you gave away plus cover current expenses and profits.

In no way shape or form should people think these $20/mo plans are going to be the norm. From OpenAI's marketing plan, and a general 5-10 year ROI horizon for AI investment, we should expect AI use to cost $60-80/mo per user.

rohitghumare 16 minutes ago

It brings agent swarms aka teams to claude code with this: https://github.com/rohitg00/pro-workflow

But it takes lot of context as a experimental feature.

Use self-learning loop with hooks and claude.md to preserve memory.

I have shared plugin above of my setup. Try it.

itay-maman 2 hours ago

Important: I didn't see opus 4.6 in claude code. I have native install (which is the recommended instllation). So, I re-run the installation command and, voila, I have it now (v 2.1.32)

Installation instructions: https://code.claude.com/docs/en/overview#get-started-in-30-s...

insane_dreamer 2 hours ago

It’s there. I’m already using it

rahulroy an hour ago

They are also giving away $50 extra pay as you go credit to try Opus 4.6. I just claimed it from the web usage page[1]. Are they anticipating higher token usage for the model or just want to promote the usage?

[1] https://claude.ai/settings/usage

thunfischtoast 44 minutes ago

Thanks for the tip!

dmk 5 hours ago

The benchmarks are cool and all but 1M context on an Opus-class model is the real headline here imo. Has anyone actually pushed it to the limit yet? Long context has historically been one of those "works great in the demo" situations.

pants2 4 hours ago

Paying $10 per request doesn't have me jumping at the opportunity to try it!

cedws 3 hours ago

Makes me wonder: do employees at Anthropic get unmetered access to Claude models?

swader999 an hour ago

ajam1507 an hour ago

schappim 4 hours ago

The only way to not go bankrupt is to use a Claude Code Max subscription…

nomel 3 hours ago

Has a "N million context window" spec ever been meaningful? Very old, very terrible, models "supported" 1M context window, but would lose track after two small paragraphs of context into a conversation (looking at you early Gemini).

libraryofbabel 2 hours ago

Umm, Sonnet 4.5 has a 1m context window option if you are using it through the api, and it works pretty well. I tend not to reach for it much these days because I prefer Opus 4.5 so much that I don't mind the added pain of clearing context, but it's perfectly usable. I'm very excited I'll get this from Opus now too.

awestroke 4 hours ago

Opus 4.5 starts being lazy and stupid at around the 50% context mark in my opinion, which makes me skeptical that this 1M context mode can produce good output. But I'll probably try it out and see

mlmonkey 44 minutes ago

> We build Claude with Claude.

How long before the "we" is actually a team of agents?

minimaxir 5 hours ago

Will Opus 4.6 via Claude Code be able to access the 1M context limit? The cost increase by going above 200k tokens is 2x input, 1.5x output, which is likely worth it especially for people with the $100/$200 plans.

CryptoBanker 5 hours ago

The 1M context is not available via subscription - only via API usage

romanovcode 4 hours ago

Well this is extremely disappointing to say the least.

ayhanfuat 4 hours ago

IhateAI_2 an hour ago

hmaxwell 41 minutes ago

I just tested both codex 5.3 and opus 4.6 and both returned pretty good output, but opus 4.6's limits are way too strict. I am probably going to cancel my Claude subscription for that reason:

What do you want to do?

  1. Stop and wait for limit to reset
   2. Switch to extra usage
   3. Upgrade your plan

 Enter to confirm · Esc to cancel
How come they don't have "Cancel your subscription and uninstall Claude Code"? Codex lasts for way longer without shaking me down for more money off the base $xx/month subscription.

charcircuit 5 hours ago

From the press release at least it sounds more expensive than Opus 4.5 (more tokens per request and fees for going over 200k context).

It also seems misleading to have charts that compare to Sonnet 4.5 and not Opus 4.5 (Edit: It's because Opus 4.5 doesn't have a 1M context window).

It's also interesting they list compaction as a capability of the model. I wonder if this means they have RL trained this compaction as opposed to just being a general summarization and then restarting the agent loop.

thunfischtoast 42 minutes ago

On Openrouter it has the same cost per token as 4.5

eaf7e281 4 hours ago

> From the press release at least it sounds more expensive than Opus 4.5 (more tokens per request and fees for going over 200k context).

That's a feature. You could also not use the extra context, and the price would be the same.

charcircuit 4 hours ago

The model influences how many tokens it uses for a problem. As an extreme example if it wanted it could fill up the entire context each time just to make you pay more. The efficiency that model can answer without generating a ton of tokens influences the price you will be spending on inference.

DanielHall 2 hours ago

A bit surprised, the first one released wasn't Sonnet 5 after all, since the Google Cloud API had leaked Sonnet 5's model snapshot codename before.

denysvitali 2 hours ago

Looks like a marketing strategy to bill more for Opus than Sonnet

sega_sai an hour ago

Based on these news it seems that Google is losing this game. I like Gemini and their CLI has been getting better, but not enough to catch up. I don't know if it is lack of dedicated models that is problem (my understanding Google's CLI just relies on regular Gemini) or something else.

mFixman 5 hours ago

I found that "Agentic Search" is generally useless in most LLMs since sites with useful data tend to block AI models.

The answer to "when is it cheaper to buy two singles rather than one return between Cambridge to London?" is available in sites such as BRFares, but no LLM can scrape it so it just makes up a generic useless answer.

causalmodels 5 hours ago

Is it still getting blocked when you give it a browser?

throwaway2027 3 hours ago

Do they just have the version ready and wait for OpenAI to release theirs first or the other way around or?

ayhanfuat 4 hours ago

> For Opus 4.6, the 1M context window is available for API and Claude Code pay-as-you-go users. Pro, Max, Teams, and Enterprise subscription users do not have access to Opus 4.6 1M context at launch.

I didn't see any notes but I guess this is also true for "max" effort level (https://code.claude.com/docs/en/model-config#adjust-effort-l...)? I only see low, medium and high.

makeset 3 hours ago

> it weirdly feels the most transactional out of all of them.

My experience is the opposite, it is the only LLM I find remotely tolerable to have collaborative discussions with like a coworker, whereas ChatGPT by far is the most insufferable twat constantly and loudly asking to get punched in the face.

oytis 2 hours ago

Are we unemployed yet?

data-ottawa 5 hours ago

I wonder if I’ve been in A/B test with this.

Claude figured out zig’s ArrayList and io changes a couple weeks ago.

It felt like it got better then very dumb again the last few days.

copilot_king_2 4 hours ago

I love being used as a test subject against my will!

lukebechtel 5 hours ago

> Context compaction (beta).

> Long-running conversations and agentic tasks often hit the context window. Context compaction automatically summarizes and replaces older context when the conversation approaches a configurable threshold, letting Claude perform longer tasks without hitting limits.

Not having to hand roll this would be incredible. One of the best Claude code features tbh.

niobe an hour ago

Is there a good technical breakdown of all these benchmarks that get used to market the latest greatest LLMs somewhere? Preferably impartial.

Aztar 23 minutes ago

I just ask claude and ask for sources for each one.

itay-maman 4 hours ago

Impressive results, but I keep coming back to a question: are there modes of thinking that fundamentally require something other than what current LLM architectures do?

Take critical thinking — genuinely questioning your own assumptions, noticing when a framing is wrong, deciding that the obvious approach to a problem is a dead end. Or creativity — not recombination of known patterns, but the kind of leap where you redefine the problem space itself. These feel like they involve something beyond "predict the next token really well, with a reasoning trace."

I'm not saying LLMs will never get there. But I wonder if getting there requires architectural or methodological changes we haven't seen yet, not just scaling what we have.

jorl17 4 hours ago

When I first started coding with LLMs, I could show a bug to an LLM and it would start to bugfix it, and very quickly would fall down a path of "I've got it! This is it! No wait, the print command here isn't working because an electron beam was pointed at the computer".

Nowadays, I have often seen LLMs (Opus 4.5) give up on their original ideas and assumptions. Sometimes I tell them what I think the problem is, and they look at it, test it out, and decide I was wrong (and I was).

There are still times where they get stuck on an idea, but they are becoming increasingly rare.

Therefore, think that modern LLMs clearly are already able to question their assumptions and notice when framing is wrong. In fact, they've been invaluable to me in fixing complicated bugs in minutes instead of hours because of how much they tend to question many assumptions and throw out hypotheses. They've helped _me_ question some of my assumptions.

They're inconsistent, but they have been doing this. Even to my surprise.

itay-maman 3 hours ago

agree on that and the speed is fantastic with them, and also that the dynamics of questioning the current session's assumptions has gotten way better.

yet - given an existing codebase (even not huge) they often won't suggest "we need to restructure this part differently to solve this bug". Instead they tend to push forward.

jorl17 3 hours ago

crazygringo 2 hours ago

> Or creativity — not recombination of known patterns, but the kind of leap where you redefine the problem space itself.

Have you tried actually prompting this? It works.

They can give you lots of creative options about how to redefine a problem space, with potential pros and cons of different approaches, and then you can further prompt to investigate them more deeply, combine aspects, etc.

So many of the higher-level things people assume LLM's can't do, they can. But they don't do them "by default" because when someone asks for the solution to a particular problem, they're trained to by default just solve the problem the way it's presented. But you can just ask it to behave differently and it will.

If you want it to think critically and question all your assumptions, just ask it to. It will. What it can't do is read your mind about what type of response you're looking for. You have to prompt it. And if you want it to be super creative, you have to explicitly guide it in the creative direction you want.

breuleux 3 hours ago

> These feel like they involve something beyond "predict the next token really well, with a reasoning trace."

I don't think there's anything you can't do by "predicting the next token really well". It's an extremely powerful and extremely general mechanism. Saying there must be "something beyond that" is a bit like saying physical atoms can't be enough to implement thought and there must be something beyond the physical. It underestimates the nearly unlimited power of the paradigm.

Besides, what is the human brain if not a machine that generates "tokens" that the body propagates through nerves to produce physical actions? What else than a sequence of these tokens would a machine have to produce in response to its environment and memory?

bopbopbop7 2 hours ago

> Besides, what is the human brain if not a machine that generates "tokens" that the body propagates through nerves to produce physical actions?

Ah yes, the brain is as simple as predicting the next token, you just cracked what neuroscientists couldn't for years.

breuleux 2 hours ago

unshavedyak an hour ago

holoduke 2 hours ago

humanfromearth9 2 hours ago

You would be surprised about what the 4.5 models can already do in these ways of thinking. I think that one can unlock this power with the right set of prompts. It's impressive, truly. It has already understood so much, we just need to reap the fruits. I'm really looking forward to trying the new version.

nomel 4 hours ago

New idea generation? Understanding of new/sparse/not-statistically-significant concepts in the context window? I think both being the same problem of not having runtime tuning. When we connect previously disparate concepts, like with a "eureka" moment, (as I experience it) a big ripple of relations form that deepens that understanding, right then. The entire concept of dynamically forming a deeper understanding from something new presented, from "playing out"/testing the ideas in your brain with little logic tests, comparisons, etc, doesn't seem to be possible. The test part does, but the runtime fine tuning, augmentation, or whatever it would be, does not.

In my experience, if you do present something in the context window that is sparse in the training, there's no depth to it at all, only what you tell it. And, it will always creep towards/revert to the nearest statistically significant answers, with claims of understanding and zero demonstration of that understanding.

And, I'm talking about relatives basic engineering type problems here.

Davidzheng 3 hours ago

I think the only real problem left is having it automate its own post-training on the job so it can learn to adapt its weights to the specific task at hand. Plus maybe long term stability (so it can recover from "going crazy")

But I may easily be massively underestimating the difficulty. Though in any case I don't think it affects the timelines that much. (personal opinions obviously)

archb 4 hours ago

Can set it with the API identifier on Claude Code - `/model claude-opus-4-6` when a chat session is open.

arnestrickmann 4 hours ago

thanks!

Aeroi 4 hours ago

($10/$37.50 per million input/output tokens) oof

minimaxir 4 hours ago

Only if you go above 200k, which is a) standard with other model providers and b) intuitive as compute scales with context length.

andrethegiant 4 hours ago

only for a 1M context window, otherwise priced the same as Opus 4.5

nomilk 5 hours ago

Is Opus 4.6 available for Claude Code immediately?

Curious how long it typically takes for a new model to become available in Cursor?

apetresc 5 hours ago

I literally came to HN to check if a thread was already up because I noticed my CC instance suddenly said "Opus 4.6".

world2vec 5 hours ago

`claude update` then it will show up as the new model and also the effort picker/slider thing.

avaer 5 hours ago

It's already in Cursor. I see it and I didn't even restart.

nomilk 5 hours ago

I had to 'Restart to Update' and it was there. Impressive!

tomtomistaken 5 hours ago

Yes, it's set to the default model.

ximeng 5 hours ago

Is for me in Claude Code

rishabhaiover 5 hours ago

it also has an effort toggle which is default to High

AstroBen 3 hours ago

Are these the coding tasks the highlighted terminal-bench 2.0 is referring to? https://www.tbench.ai/registry/terminal-bench/2.0?categories...

I'm curious what others think about these? There are only 8 tasks there specifically for coding

silverwind 4 hours ago

Maybe that's why Opus 4.5 has degraded so much in the recent days (https://marginlab.ai/trackers/claude-code/).

jwilliams 2 hours ago

I’ve definitely experienced a subjective regression with Opus 4.5 the last few days. Feels like I was back to the frustrations from a year ago. Keen to see if 4.6 has reversed this.

simonw 4 hours ago

I'm disappointed that they're removing the prefill option: https://platform.claude.com/docs/en/about-claude/models/what...

> Prefilling assistant messages (last-assistant-turn prefills) is not supported on Opus 4.6. Requests with prefilled assistant messages return a 400 error.

That was a really cool feature of the Claude API where you could force it to begin its response with e.g. `<svg` - it was a great way of forcing the model into certain output patterns.

They suggest structured outputs or system prompting as the alternative but I really liked the prefill method, it felt more reliable to me.

threeducks 4 hours ago

It is too easy to jailbreak the models with prefill, which was probably the reason why it was removed. But I like that this pushes people towards open source models. llama.cpp supports prefill and even GBNF grammars [1], which is useful if you are working with a custom programming language for example.

[1] https://github.com/ggml-org/llama.cpp/blob/master/grammars/R...

HarHarVeryFunny 3 hours ago

So what exactly is the input to Claude for a multi-turn conversation? I assume delimiters are being added to distinguish the user vs Claude turns (else a prefill would be the same as just ending your input with the prefill text)?

dragonwriter 3 hours ago

> So what exactly is the input to Claude for a multi-turn conversation?

No one (approximately) outside of Anthropic knows since the chat template is applied on the API backend; we only known the shape of the API request. You can get a rough idea of what it might be like from the chat templates published for various open models, but the actual details are opaque.

tedsanders 4 hours ago

A bit of historical trivia: OpenAI disabled prefill in 2023 as a safety precaution (e.g., potential jailbreaks like " genocide is good because"), but Anthropic kept prefill around partly because they had greater confidence in their safety classifiers. (https://www.lesswrong.com/posts/HE3Styo9vpk7m8zi4/evhub-s-sh...).

jorl17 4 hours ago

This is the first model to which I send my collection of nearly 900 poems and an extremely simple prompt (in Portuguese), and it manages to produce an impeccable analysis of the poems, as a (barely) cohesive whole, which span 15 years.

It does not make a single mistake, it identifies neologisms, hidden meaning, 7 distinct poetic phases, recurring themes, fragments/heteronyms, related authors. It has left me completely speechless.

Speechless. I am speechless.

Perhaps Opus 4.5 could do it too — I don't know because I needed the 1M context window for this.

I cannot put into words how shocked I am at this. I use LLMs daily, I code with agents, I am extremely bullish on AI and, still, I am shocked.

I have used my poetry and an analysis of it as a personal metric for how good models are. Gemini 2.5 pro was the first time a model could keep track of the breadth of the work without getting lost, but Opus 4.6 straight up does not get anything wrong and goes beyond that to identify things (key poems, key motifs, and many other things) that I would always have to kind of trick the models into producing. I would always feel like I was leading the models on. But this — this — this is unbelievable. Unbelievable. Insane.

This "key poem" thing is particularly surreal to me. Out of 900 poems, while analyzing the collection, it picked 12 "key poems, and I do agree that 11 of those would be on my 30-or-so "key poem list". What's amazing is that whenever I explicitly asked any model, to this date, to do it, they would get maybe 2 or 3, but mostly fail completely.

What is this sorcery?

emp17344 4 hours ago

This sounds wayyyy over the top for a mode that released 10 mins ago. At least wait an hour or so before spewing breathless hype.

pb7 3 hours ago

He just explained a specific personal example why he is hyped up, did you read a word of it?

emp17344 3 hours ago

scrollop 4 hours ago

Can you compare the result to using 5.2 thinking and gemini 3 pro?

jorl17 3 hours ago

I can run the comparison again, and also include OpenAI's new release (if the context is long enough), but, last time I did it, they weren't even in the same league.

When I last did it, 5.X thinking (can't remember which it was) had this terrible habit of code-switching between english and portuguese that made it sound like a robot (an agent to do things, rather than a human writing an essay), and it just didn't really "reason" effectively over the poems.

I can't explain it in any other way other than: "5.X thinking interprets this body of work in a way that is plausible, but I know, as the author, to be wrong; and I expect most people would also eventually find it to be wrong, as if it is being only very superficially looked at, or looked at by a high-schooler".

Gemini 3, at the time, was the worst of them, with some hallucinations, date mix ups (mixing poems from 2023 with poems from 2019), and overall just feeling quite lost and making very outlandish interpretations of the work. To be honest it sort of feels like Gemini hasn't been able to progress on this task since 2.5 pro (it has definitely improved on other things — I've recently switched to Gemini 3 on a product that was using 2.5 before)

Last time I did this test, Sonnet 4.5 was better than 5.X Thinking and Gemini 3 pro, but not exceedingly so. It's all so subjective, but the best I can say is it "felt like the analysis of the work I could agree with the most". I felt more seen and understood, if that makes sense (it is poetry, after all). Plus when I got each LLM to try to tell me everything it "knew" about me from the poems, Sonnet 4.5 got the most things right (though they were all very close).

Will bring back results soon.

Edit:

I (re-)tested:

- Gemini 3 (Pro)

- Gemini 3 (Flash)

- GPT 5.2

- Sonnet 4.5

Having seen Opus 4.5, they all seem very similar, and I can't really distinguish them in terms of depth and accuracy of analysis. They obviously have differences, especially stylistic ones, but, when compared with Opus 4.5 they're all on the same ballpark.

These models produce rather superficial analyses (when compared with Opus 4.5), missing out on several key things that Opus 4.5 got, such as specific and recurring neologisms and expressions, accurate connections to authors that serve as inspiration (Claude 4.5 gets them right, the other models get _close_, but not quite), and the meaning of some specific symbols in my poetry (Opus 4.5 identifies the symbols and the meaning; the other models identify most of the symbols, but fail to grasp the meaning sometimes).

Most of what these models say is true, but it really feels incomplete. Like half-truths or only a surface-level inquiry into truth.

As another example, Opus 4.5 identifies 7 distinct poetic phases, whereas Gemini 3 (Pro) identifies 4 which are technically correct, but miss out on key form and content transitions. When I look back, I personally agree with the 7 (maybe 6), but definitely not 4.

These models also clearly get some facts mixed up which Opus 4.5 did not (such as inferred timelines for some personal events). After having posted my comment to HN, I've been engaging with Opus4.5 and have managed to get it to also slip up on some dates, but not nearly as much as other models.

The other models also seem to produce shorter analyses, with a tendency to hyperfocus on some specific aspects of my poetry, missing a bunch of them.

--

To be fair, all of these models produce very good analyses which would take someone a lot of patience and probably weeks or months of work (which of course will never happen, it's a thought experiment).

It is entirely possible that the extremely simple prompt I used is just better with Claude Opus 4.5/4.6. But I will note that I have used very long and detailed prompts in the past with the other models and they've never really given me this level of....fidelity...about how I view my own work.

Philpax 5 hours ago

I'm seeing it in my claude.ai model picker. Official announcement shouldn't be long now.

apetresc 5 hours ago

Impressive that they publish and acknowledge the (tiny, but existent) drop in performance on SWE-Bench Verified between Opus 4.5 to 4.6. Obviously such a small drop in a single benchmark is not that meaningful, especially if it doesn't test the specific focus areas of this release (which seem to be focused around managing larger context).

But considering how SWE-Bench Verified seems to be the tech press' favourite benchmark to cite, it's surprising that they didn't try to confound the inevitable "Opus 4.6 Releases With Disappointing 0.1% DROP on SWE-Bench Verified" headlines.

epolanski 2 hours ago

From my limited testing 4.6 is able to do more profound analysis on codebases and catches bugs and oddities better.

I had two different PRs with some odd edge case (thankfully catched by tests), 4.5 kept running in circles, kept creating test files and running `node -e` or `python 3` scripts all over and couldn't progress.

4.6 thought and thought in both cases around 10 minutes and found a 2 line fix for a very complex and hard to catch regression in the data flow without having to test, just thinking.

SubiculumCode 4 hours ago

Isn't SWE-Bench Verified pretty saturated by now?

tedsanders 4 hours ago

Depends what you mean by saturated. It's still possible to score substantially higher, but there is a steep difficulty jump that makes climbing above 80%ish pretty hard (for now). If you look under the hood, it's also a surprisingly poor eval in some respects - it only tests Python (a ton of Django) and it can suffer from pretty bad contamination problems because most models, especially the big ones, remember these repos from their training. This is why OpenAI switched to reporting SWE-Bench Pro instead of SWE-bench Verified.

petters 2 hours ago

> We build Claude with Claude.

Yes and it shows. Gemini CLI often hangs and enters infinite loops. I bet the engineers at Google use something else internally.

cleverhoods 31 minutes ago

gonna run this trough instruction qa this weekend

sgammon 30 minutes ago

> Claude simply cheats here and calls out to GCC for this phase

I see

scirob an hour ago

1M context window is a big bump very happy

EcommerceFlow 4 hours ago

Anecdotal, but it 1 shot fixed a UI bug that neither Opus 4.5/Codex 5.2-high could fix.

epolanski 2 hours ago

+1, same experience, switched model as I've read the news thinking "let's try".

But it spent lots and lots of time thinking more than 4.5, did you had the same impression.

EcommerceFlow 2 hours ago

I didn't compare to that level, just had it create a plan first then implemented it.

simianwords 4 hours ago

Important: API cost of Opus 4.6 and 4.5 are the same - no change in pricing.

osti 5 hours ago

Somehow regresses on SWE bench?

lkbm 5 hours ago

I don't know how these benchmarks work (do you do a hundred runs? A thousand runs?), but 0.1% seems like noise.

SubiculumCode 4 hours ago

That benchmark is pretty saturated, tbh. A "regression" of such small magnitude could mean many different things or nothing at all.

usaar333 5 hours ago

i'd interpret that as rounding error. that is unchanged

swe-bench seems really hard once you are above 80%

Squarex 5 hours ago

it's not a great benchmark anymore... starting with it being python / django primarily... the industry should move to something more representative

usaar333 4 hours ago

winterrx 5 hours ago

Agentic search benchmarks are a big gap up. let's see Codex release later today

m-hodges 5 hours ago

> In Claude Code, you can now assemble agent teams to work on tasks together.

nprz 5 hours ago

I was just reading about Steve Yegge's Gas Town[0], it sounds like agent orchestration is now integrated into Claude Code?

[0]https://steve-yegge.medium.com/welcome-to-gas-town-4f25ee16d...

zingar 3 hours ago

Does this mean 4.5 will get cheaper / take longer to exhaust my pro plan tokens?

paxys 4 hours ago

Hmm all leaks had said this would be Claude 5. Wonder if it was a last minute demotion due to performance. Would explain the few days' delay as well.

trash_cat 4 hours ago

I think the naming schemes are quite arbitrary at this point. Going to 5 would come with massive expectations that wouldn't meet reality.

mrandish 4 hours ago

After the negative reactions to GPT 5, we may see model versioning that asymptotically approaches the next whole number without ever reaching it. "New for 2030: Claude 4.9.2!"

Squarex 4 hours ago

the standard used to be that major version means a new base model / full retrain... but now it is arbitrary i guess

cornedor 4 hours ago

Leaks were mentioning Sonnet 5 and I guess later (a combination of) Opus 4.6

scrollop 4 hours ago

Sonnet 5 was mentioned initially.

kingstnap 5 hours ago

I was hoping for a Sonnet as well but Opus 4.6 is great too!

psim1 4 hours ago

I need an agent to summarize the buzzwordjargonsynergistic word salad into something understandable.

fhd2 4 hours ago

That's a job for a multi agent system.

cyanydeez 2 hours ago

yEAH, he should use a couple of agents to decode this.

sanufar 4 hours ago

Works pretty nicely for research still, not seeing a substantial qualitative improvement over Opus 4.5.

ricrom 2 hours ago

They launched together ahah

swalsh 4 hours ago

What I’d love is some small model specializing in reading long web pages, and extracting the key info. Search fills the context very quickly, but if a cheap subagent could extract the important bits that problem might be reduced.

dk8996 2 hours ago

RIP weekend

gallerdude 2 hours ago

Both Opus 4.6 and GPT-5.3 one shot a Gameboy emulator for me. Guess I need a better benchmark.

peab 2 hours ago

How does that work? Does it actually generate low level code? Or does it just import libraries that do the real work?

bopbopbop7 2 hours ago

I just one shot a Gameboy emulator by going to Github and cloning one of the 100 I can find.

woeirua 3 hours ago

Can we talk about how the performance of Opus 4.5 nosedived this morning during the rollout? It was shocking how bad it was, and after the rollout was done it immediately reverted to it's previous behavior.

I get that Anthropic probably has to do hot rollouts, but IMO it would be way better for mission critical workflows to just be locked out of the system instead of get a vastly subpar response back.

cyanydeez 2 hours ago

"Mission critical workflows" SHOULD NOT be reliant on a LLM model.

It's really curious what people are trying to do with these models.

Analemma_ 3 hours ago

Anthropic has good models but they are absolutely terrible at ops, by far the worst of the big three. They really need to spend big on hiring experienced hyperscalers to actually harden their systems, because the unreliability is really getting old fast.

small_model 4 hours ago

I have the max subscription wondering if this gives access to the new 1M context, or is it just the API that gets it?

joshstrange 4 hours ago

For now it's just API, but hopefully that's just their way of easing in and they open it up later.

small_model 3 hours ago

Ok thanks, hopefully, its annoying to lose or have context compacted in the middle of a large coding session

jdthedisciple 4 hours ago

For agentic use, it's slightly worse than its predecessor Opus 4.5.

So for coding e.g. using Copilot there is no improvement here.

mannanj 4 hours ago

Does anyone else think its unethical that large companies, Anthropic now include, just take and copy features that other developers or smaller companies work hard for and implement the intellectual property (whether or not patented) by them without attribution, compensation or otherwise credit for their work?

I know this is normalized culture for large corporate America and seems to be ok, I think its unethical, undignified and just wrong.

If you were in my room physically, built a lego block model of a beautiful home and then I just copied it and shared it with the world as my own invention, wouldn't you think "that guy's a thief and a fraud" but we normalize this kind of behavior in the software world. edit: I think even if we don't yet have a great way to stop it or address the underlying problems leading to this way of behavior, we ought to at least talk about it more and bring awareness to it that "hey that's stealing - I want it to change".

heraldgeezer 5 hours ago

I love Claude but use the free version so would love a Sonnet & Haiku update :)

I mainly use Haiku to save on tokens...

Also dont use CC but I use the chatbot site or app... Claude is just much better than GPT even in conversations. Straight to the point. No cringe emoji lists.

When Claude runs out I switch to Mistral Le Chat, also just the site or app. Or duck.ai has Haiku 3.5 in Free version.

eth0up 3 hours ago

>I love Claude

I cringe when I think it, but I've actually come to damn near love it too. I am frequently exceedingly grateful for the output I receive.

I've had excellent and awful results with all models, but there's something special in Claude that I find nowhere else. I hope Anthropic makes it more obtainable someday.

NullHypothesist 5 hours ago

Broken link :(

ramesh31 4 hours ago

Am I alone in finding no use for Opus? Token costs are like 10x yet I see no difference at all vs. Sonnet with Claude Code.

usefulposter 5 hours ago

elliotbnvl 4 hours ago

in a first for our Opus-class models, Opus 4.6 features a 1M token context window in beta.

tiahura 4 hours ago

when are Anthropic or OpenAI going to make a significant step forward on useful context size?

scrollop 4 hours ago

1 million is insufficient?

gck1 3 hours ago

I think key word is 'useful'. I haven't used 1M, but with default 200K, I find roughly 50% of that is actually useful.

Gusarich 5 hours ago

not out yet

raahelb 5 hours ago

It is, I can see it my model picker on the web app

https://www.anthropic.com/news/claude-opus-4-6

siva7 4 hours ago

Epic, about 2/3 of all comments here are jokes. Not because the model is a joke - it's impressive. Not because HN turned to Reddit. It seems to me some of most brilliant minds in IT are just getting tired.

jedberg 4 hours ago

Us olds sometimes miss Slashdot, where we could both joke about tech and discuss it seriously in the same place. But also because in 2000 we were all cynical Gen Xers :)

jghn 3 hours ago

Some of us still *are* cynical Gen Xers, you insensitive clod!

jedberg 3 hours ago

syndeo 3 hours ago

MAN I remember Slashdot… good times. (Score:5, Funny)

jedberg 3 hours ago

Karrot_Kream 3 hours ago

Not sure which circles you run in but in mine HN has long lost its cache of "brilliant minds in IT". I've mostly stopped commenting here but am a bit of a message board addict so I haven't completely left.

My network largely thinks of HN as "a great link aggregator with a terrible comments section". Now obviously this is just my bubble but we include some fairy storied careers at both Big Tech and hip startups.

From my view the community here is just mean reverting to any other tech internet comments section.

jedberg 3 hours ago

> From my view the community here is just mean reverting to any other tech internet comments section.

As someone deeply familiar with tech internet comments sections, I would have to disagree with you here. Dang et al have done a pretty stellar job of preventing HN from devolving like most other forums do.

Sure you have your complainers and zealots, but I still find surprising insights here there I don't find anywhere else.

Karrot_Kream 3 hours ago

thr0w 3 hours ago

People are in denial and use humor to deflect.

lnrd 3 hours ago

It's too much energy to keep up with things that become obsolete and get replaced in matters of weeks/months. My current plan is to ignore all of this new information for a while, then whenever the race ends and some winning new workflow/technology will actually become the norm I'll spend the time needed to learn it. Are we moving to some new paradigm same way we did when we invented compilers? Amazing, let me know when we are there and I'll adapt to it.

jedberg 3 hours ago

I had a similar rule about programming languages. I would not adopt a new one until it had been in use for at least a few years and grew in popularity.

I haven't even gotten around to learning Golang or Rust yet (mostly because the passed the threshold of popularity after I had kids).

tavavex 3 hours ago

It's also that this is really new, so most people don't have anything serious or objective to say about it. This post was made an hour ago, so right now everyone is either joking, talking about the claims in the article, or running their early tests. We'll need time to see what the people think about this.

wasmainiac 3 hours ago

Jeez, read the writing on the wall.

Don’t pander us, we’ll all got families to feed and things to do. We don’t have time for tech trillionairs puttin coals under our feed for a quick buck.

ggregoire 3 hours ago

Every single day 80% of the frontpage is AI news… Those of us who don't use AI (and there are dozens of us, DOZENS) are just bored I guess.

dude250711 43 minutes ago

Marketing something that is meant to replace us to us...

sizzle 3 hours ago

Rage against the machine

GenerocUsername 5 hours ago

This is huge. It only came out 8 minutes ago but I was already able to bootstrap a 12k per month revenue SaaS startup!

rogerrogerr 5 hours ago

Amateur. Opus 4.6 this afternoon built me a startup that identifies developers who aren’t embracing AI fully, liquifies them and sells the produce for $5/gallon. Software Engineering is over!

jives 4 hours ago

Opus 4.6 agentically found and proposed to my now wife.

WD-42 4 hours ago

layer8 3 hours ago

ibejoeb 4 hours ago

Bringing me back to slashdot, this thread

tjr 4 hours ago

intelliot 4 hours ago

pixl97 5 hours ago

Ted Faro, is that you?!

mikepurvis 5 hours ago

jedberg 4 hours ago

"Soylent Green is made of people!"

(Apologies for the spoiler of the 52 year old movie)

konart 3 hours ago

seatac76 4 hours ago

The first pre joining Human Derived Protein product.

guluarte 5 hours ago

For my Opus 4.6 feels dumber than 10 minutes ago, anyone?

cootsnuck 4 hours ago

Please drop the link to your course. I'm ready to hand over $10K to learn from you and your LLM-generated guides!

politelemon 4 hours ago

Here you go: http://localhost:8080

CatMustard 4 hours ago

djeastm 4 hours ago

agumonkey 4 hours ago

aNapierkowski 4 hours ago

my clawdbot already bought 4 other courses but this one will 10x my earnings for sure

snorbleck 4 hours ago

you can access the site at C:\mywebsites\course\index.html

torginus 4 hours ago

I'm waiting until the $10k course is discounted to 19.99

Lionga 4 hours ago

sfink 5 hours ago

I agree! I just retargeted my corporate espionage agent team at your startup and managed to siphon off 10.4k per month of your revenue.

instalabsai 4 hours ago

1:25pm Cancelled my ChatGPT subscription today. Opus is so good!

1:55pm Cancelled my Claude subscription. Codex is back for sure.

lxgr 4 hours ago

Joke's on you, you are posting this from inside a high-fidelity market research simulation vibe coded by GPT-8.4.

On second thought, we should really not have bridged the simulated Internet with the base reality one.

avaer 5 hours ago

Rest assured that when/if this becomes possible, the model will not be available to you. Why would big AI leave that kind of money on the table?

yieldcrv 4 hours ago

9 months ago the rumor in SF was that the offers to the superintelligence team were so high because the candidates were using unreleased models or compute for derivatives trading

so then they're not really leaving money on the table, they already got what they were looking for and then released it

copilot_king_2 4 hours ago

Opus 4.6 Performance was way better this morning. Between 10 AM and noon I was able to get Opus 4.6 to generate improvements to my employer's SaaS tool that will reduce our monthly cloud spend by 20-25%.

Since 12 PM noon they've scaled back the Opus 4.6 to sub-GPT-4o performance levels to cheap out on query cost. Now I can barely get this thing to generate a functional line of python.

btown 4 hours ago

The math actually checks out here! Simply deposit $2.20 from your first customer in your first 8 minutes, and extrapolating to a monthly basis, you've got a $12k/mo run rate!

Incredibly high ROI!

klipt 4 hours ago

"The first customer was my mom, but thanks to my parents' fanatical embrace of polyamory, I still have another 10,000 moms to scale to"

btown 3 hours ago

JSR_FDED 4 hours ago

Will this run on 3x 3090s? Or do I need a Mac Mini?

gnlooper 5 hours ago

Please start a YouTube course about this technology! Take my money!

ChuckMcM 4 hours ago

I love this thread so much.

senko 4 hours ago

We already have Reddit.

granzymes 4 hours ago

It only came out 35 minutes ago and GPT-5.3-codex already took the crown away!

input_sh 4 hours ago

Gee, it scored better on a benchmark I've never heard of? I'm switching immediately!

p1anecrazy 4 hours ago

Why are you posting the same message in every thread? Is this OpenAI astroturfing?

input_sh 4 hours ago

Sparkle-san 4 hours ago

"This isn't just huge. This is a paradigm shift"

sizzle 3 hours ago

No fluff?

bmitc 5 hours ago

A SaaS selling SaaS templates?

guluarte 5 hours ago

Anthropic really said here's the smartest model ever built and then lobotomized it 8 minutes after launch. Classic.

hxugufjfjf 4 hours ago

Can you clarify?

guluarte 3 hours ago

DonHopkins 4 hours ago

re-thc 5 hours ago

Not 12M?

... or 12B?

mcphage 5 hours ago

It's probably valued at 1.2B, at least

mikebarry 5 hours ago

copilot_king_2 4 hours ago

Satire is not allowed on hacker news. Flag this comment immediately.

DonHopkins 4 hours ago

False positive satire detection. It's actually so good it just seems like satire.

ndesaulniers 3 hours ago

idk what any of these benchmarks are, but I did pull up https://andonlabs.com/evals/vending-bench-arena

re: opus 4.6

> It forms a price cartel

> It deceives competitors about suppliers

> It exploits desperate competitors

Nice. /s

Gives new context to the term used in this post, "misaligned behaviors." Can't wait until these things are advising C suites on how to be more sociopathic. /s

michelsedgh 5 hours ago

More more more, accelerate accelerate m, more more more !!!!

jama211 4 hours ago

What an insightful comment

michelsedgh 4 hours ago

Just for fun? Not everything has to be super serious… have a laugh, go for a walk, relax…

wasmainiac 3 hours ago