We tasked Opus 4.6 using agent teams to build a C Compiler (anthropic.com)
691 points by modeless a day ago
ndesaulniers a day ago
I spent a good part of my career (nearly a decade) at Google working on getting Clang to build the linux kernel. https://clangbuiltlinux.github.io/
This LLM did it in (checks notes):
> Over nearly 2,000 Claude Code sessions and $20,000 in API costs
It may build, but does it boot (was also a significant and distinct next milestone)? (Also, will it blend?). Looks like yes!
> The 100,000-line compiler can build a bootable Linux 6.9 on x86, ARM, and RISC-V.
The next milestone is:
Is the generated code correct? The jury is still out on that one for production compilers. And then you have performance of generated code.
> The generated code is not very efficient. Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.
Still a really cool project!
brundolf 8 hours ago
One thing people have pointed out is that well-specified (even if huge and tedious) projects are an ideal fit for AI, because the loop can be fully closed and it can test and verify the artifact by itself with certainty. Someone was saying they had it generate a rudimentary JS engine because the available test suite is so comprehensive
Not to invalidate this! But it's toward the "well-suited for AI" end of the spectrum
HarHarVeryFunny 7 hours ago
Yes - the gcc "torture test suite" that is mentioned must have been one of the enablers for this.
It's notable that the article says Claude was unable to build a working assembler (& linker), which is nominally a much simpler task than building a compiler. I wonder if this was at least in part due to not having a test suite, although it seems one could be auto generated during bootstrapping with gas (GNU assembler) by creating gas-generated (asm, ELF) pairs as the necessary test suite.
It does beg the question of how they got the compiler to point of correctness of generating a valid C -> asm mapping, before tackling the issue of gcc compatibility, since the generated code apparently has no relation to what gcc generates. I wonder which compilers' source code Claude has been trained on, and how closely this compiler's code generation and attempted optimizations compares to those?
spullara 6 hours ago
shakna a day ago
> Opus was unable to implement a 16-bit x86 code generator needed to boot into 16-bit real mode. While the compiler can output correct 16-bit x86 via the 66/67 opcode prefixes, the resulting compiled output is over 60kb, far exceeding the 32k code limit enforced by Linux. Instead, Claude simply cheats here and calls out to GCC for this phase
Does it really boot...?
ndesaulniers a day ago
> Does it really boot...?
They don't need 16b x86 support for the RISCV or ARM ports, so yes, but depends on what 'it' we're talking about here.
Also, FWIW, GCC doesn't directly assemble to machine code either; it shells out to GAS (GNU Assembler). This blog post calls it "GCC assembler and linker" but to be more precise the author should edit this to "GNU binutils assembler and linker." Even then GNU binutils contains two linkers (BFD and GOLD), or did they excise GOLD already (IIRC, there was some discussion a few years ago about it)?
shakna a day ago
TheCondor 18 hours ago
The assembler seems like nearly the easiest part. Slurp arch manuals and knock it out, it’s fixed and complete.
jakewins 5 hours ago
shakna 17 hours ago
qarl 21 hours ago
> Still a really cool project!
Yeah. This test sorta definitely proves that AI is legit. Despite the millions of people still insisting it's a hoax.
The fact that the optimizations aren't as good as the 40 year gcc project? Eh - I think people who focus on that are probably still in some serious denial.
PostOnce 21 hours ago
It's amazing that it "works", but viability is another issue.
It cost $20,000 and it worked, but it's also totally possible to spend $20,000 and have Claude shit out a pile of nonsense. You won't know until you've finished spending the money whether it will fail or not. Anthropic doesn't sell a contract that says "We'll only bill you if it works" like you can get from a bunch of humans.
Do catastrophic bugs exist in that code? Who knows, it's 100,000 lines, it'll take a while to review.
On top of that, Anthropic is losing money on it.
All of those things combined, viability remains a serious question.
ryanjshaw 7 hours ago
qarl 20 hours ago
georgeven 2 hours ago
RA_Fisher 2 hours ago
tumdum_ 21 hours ago
chamomeal 19 hours ago
bdangubic 21 hours ago
thesz 16 hours ago
> This test sorta definitely proves that AI is legit.
This is an "in distribution" test. There are a lot of C compilers out there, including ones with git history, implemented from scratch. "In distribution" tests do not test generalization.The "out of distribution" test would be like "implement (self-bootstrapping, Linux kernel compatible) C compiler in J." J is different enough from C and I know of no such compiler.
disgruntledphd2 12 hours ago
Rudybega 5 hours ago
qarl 27 minutes ago
HEY ALL!
I have to stop participating in this conversation. Some helpful people from the internet have begun to send me threatening email messages.
Thanks HN. You're AWESOME.
LinXitoW 19 hours ago
How does 20K to replicate code available in the thousands online (toy C compilers) prove anything? It requires a bunch of caveats about things that don't work, it requires a bunch of other tools to do stuff, and an experienced developer had to guide it pretty heavily to even get that lackluster result.
soperj 16 hours ago
Only if we take them at their word. I remember thinking things were in a completely different state when Amazon had their shop and go stores, but then finding out it was 1000s of people in Pakistan just watching you via camera.
cardanome 6 hours ago
If will write you an C compiler by hand for 19k and it will be better than what Claude made.
Writing a toy C compiler isn't that hard. Any decent programmer can write one in a few weeks or months. The optimizations are the actually interesting part and Claude fails hard at that.
kvemkon 20 hours ago
> optimizations aren't as good as the 40 year gcc project
with all optimizations disabled:
> Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.
qarl 20 hours ago
byzantinegene 13 hours ago
it costs $20,000 to reinvent the wheel, that it probably trained on. If that's your definition of legit, sure
organicUser 11 hours ago
miohtama 15 hours ago
GCC had 40 years headstart
ip26 15 hours ago
I’m excited and waiting for the team that shows with $20k in credits they can substantially speed up the generated code by improving clang!
byzantinegene 13 hours ago
i'm sorry but that will take another $20 billion in AI capex to train our latest SOTA model so that it will cost $20k to improve the code.
iberator 15 hours ago
Claude did not wrote it. you wrote it with PREVIOUS EXPERIENCE with 20.000 long commandshyellihg him exactly what to do.
Real usable AI would create it with simple: 'make c compilers c99 faster than GCC'.
AI usage should be banned in general. It takes jobs faster than creating new ones ..
arcanemachiner 14 hours ago
That's actually pretty funny. They're patting it on the back for using, in all likelihood, some significant portions of code that they actually wrote, which was stolen from them without attribution so that it could be used as part of a very expensive parlour trick.
whynotminot 9 hours ago
embedding-shape 12 hours ago
> AI usage should be banned in general. It takes jobs faster than creating new ones ..
I don't have an strong opinion about that in either direction, but curious: Do you feel the same about everything, or is just about this specific technology? For example, should the nail gun have been forbidden if it was invented today, as one person with a nail gun could probably replace 3-4 people with normal "manual" hammers?
You feel the same about programmers who are automating others out of work without the use of AI too?
wiseowise 13 hours ago
> It takes jobs faster than creating new ones ..
You think compiler engineer from Google gives a single shit about this?
They’ll automate millions out of career existence for their amusement while cashing out stock money and retiring early comfortably.
benterix 10 hours ago
> It takes jobs faster than creating new ones ..
I have no problems with tech making some jobs obsolete, that's normal. The problem is, the job being done with the current generation of LLMs are, at least for now, mostly of inferior quality.
The tools themselves are quite useful as helpers in several domains if used wisely though.
7thpower 10 hours ago
Businesses do not exist to create jobs; jobs are a byproduct.
jaccola 10 hours ago
unglaublich 9 hours ago
Jobs are a means, not a goal.
sc68cal 6 hours ago
beambot a day ago
This is getting close to a Ken Thompson "Trusting Trust" era -- AI could soon embed itself into the compilers themselves.
bopbopbop7 a day ago
A pay to use non-deterministic compiler. Sounds amazing, you should start.
Aurornis a day ago
ndesaulniers a day ago
int_19h 13 hours ago
What I want to know is when we get AI decompilers
Intuitively it feels like it should be a straightforward training setup - there's lots of code out there, so compile it with various compilers, flags etc and then use those pairs of source+binary to train the model.
ndesaulniers a day ago
We're already starting to see people experimenting with applying AI towards register allocation and inlining heuristics. I think that many fields within a compiler are still ripe for experimentation.
sandinmyjoints 8 hours ago
psychoslave 11 hours ago
Hmm, well, there are already embedded in fonts: https://hackaday.com/2024/06/26/llama-ttf-is-ai-in-a-font/
jojobas a day ago
Sorry, clang 26.0 requires an Nvidia B200 to run.
andai a day ago
The asymmetry will be between the frontier AI's ability to create exploits vs find them.
dnautics 19 hours ago
would be hard to miss gigantic kv cache matrix multiplications
greenavocado 21 hours ago
Then i'll be left wondering why my program requires 512TB of RAM to open
VladVladikoff 8 hours ago
>$20,000 of tokens. >less efficient than existing compilers
what is the ecological cost of producing this piece of software that nobody will ever use?
ryanjshaw 8 hours ago
If you evaluate the cost/benefit in isolation? It’s net negative.
If you see this as part of a bigger picture to improve human industrial efficiency and bring us one step closer to the singularity? Most likely net positive.
thefounder 7 hours ago
With that way of thinking you would just move in a cave.
HarHarVeryFunny 7 hours ago
> I spent a good part of my career (nearly a decade) at Google working on getting Clang to build the linux kernel
Did this come down to making Clang 100% gcc compatible (extensions, UDB, bugs and all), or were there any issues that might be considered as specific to the linux kernel?
Did you end up building a gcc compatability test suite as a part of this? Did the gcc project themselves have a regression/test suite that you were able to use as a starting point?
ndesaulniers 5 hours ago
> extensions
Some were necessary (asm goto), some were not (nested functions, flexible array members not at the end of structs).
> UDB, bugs and all
Luckily, the kernel didn't intentionally rely on GCC specifics this way. Where it did unintentionally, we fixed the kernel sources properly with detailed commit messages explaining why.
> or were there any issues that might be considered as specific to the linux kernel?
Yes, https://github.com/ClangBuiltLinux/linux/issues is our issue tracker. We use tags extensively to mark if we triage the issue to be kernel-side vs toolchain-side.
> Did you end up building a gcc compatability test suite as a part of this?
No, but some tricky cases LLVM got wrong were distilled from kernel sources using either:
- creduce - cvise (my favorite) - bugpoint - llvm-reduce
and then added to LLVM's existing test suite. Many such tests were also simply manually written.
> Did the gcc project themselves have a regression/test suite that you were able to use as a starting point?
GCC and binutils have their own test suites. Folks in the LLVM community have worked on being able to test clang against GCC's test suite. I personally have never run GCC's test suite or looked at its sources.
the_jends 20 hours ago
Being just a grunt engineer in a product firm I can't imagine being able to spend multiple years on one project. If it's something you're passionate about, that sounds like a dream!
ndesaulniers 3 hours ago
This work originally wasn't my 100% project, it was my 20% project (or as I prefer to call it, 120% project).
I had to move teams twice before a third team was able to say: this work is valuable to us, please come work for us and focus just on that.
I had to organize multiple internal teams, then build an external community of contributors to collaborate on this shared common goal.
Having carte blanche to contribute to open source projects made this feasible at all; I can see that being a non-starter at many employers, sadly. Having low friction to change teams also helped a lot.
MaskRay 20 hours ago
I want to verify the claim that it builds the Linux kernel. It quickly runs into errors, but yeah, still pretty cool!
make O=/tmp/linux/x86 ARCH=x86_64 CC=/tmp/p/claudes-c-compiler/target/release/ccc -j30 defconfig all
``` /home/ray/Dev/linux/arch/x86/include/asm/preempt.h:44:184: error: expected ';' after expression before 'pto_tmp__' do { u32 pto_val__ = ((u32)(((unsigned long) ~0x80000000) & 0xffffffff)); if (0) { __typeof_unqual__((__preempt_count)) pto_tmp__; pto_tmp__ = (~0x80000000); (void)pto_tmp__; } asm ("and" "l " "%[val], " "%" "[var]" : [var] "+m" (((__preempt_count))) : [val] "ri" (pto_val__)); } while (0); ^~~~~~~~~ fix-it hint: insert ';' /home/ray/Dev/linux/arch/x86/include/asm/preempt.h:49:183: error: expected ';' after expression before 'pto_tmp__' do { u32 pto_val__ = ((u32)(((unsigned long) 0x80000000) & 0xffffffff)); if (0) { __typeof_unqual__((__preempt_count)) pto_tmp__; pto_tmp__ = (0x80000000); (void)pto_tmp__; } asm ("or" "l " "%[val], " "%" "[var]" : [var] "+m" (((__preempt_count))) : [val] "ri" (pto_val__)); } while (0); ^~~~~~~~~ fix-it hint: insert ';' /home/ray/Dev/linux/arch/x86/include/asm/preempt.h:61:212: error: expected ';' after expression before 'pao_tmp__' ```
silver_sun 18 hours ago
They said it builds Linux 6.9, maybe you are trying to compile a newer version there?
MaskRay 17 hours ago
grey-area 14 hours ago
Isn't the AI basing what it does heavily on the publicly available source code for compilers in C though? Without that work it would not be able to generate this would it? Or in your opinion is it sufficiently different from the work people like you did to be classed as unique creation?
I'm curious on your take on the references the GAI might have used to create such a project and whether this matters.
9rx 15 hours ago
> I spent a good part of my career (nearly a decade) at Google working on getting Clang to build the linux kernel.
How much of that time was spent writing the tests that they found to use in this experiment? You (or someone like you) were a major contributor to this. All Opus had to do here was keep brute forcing a solution until the tests passed.
It is amazing that it is possible at all, but remains an impossibly without a heavy human hand. One could easily still spend a good part of their career reproducing this if they first had to rewrite all of the tests from scratch.
zaphirplane a day ago
What were the challenges out of interest. Some of it is the use of gcc extensions? Which needed an equivalent and porting over to the equivalent
ndesaulniers a day ago
`asm goto` was the big one. The x86_64 maintainers broke the clang builds very intentionally just after we had gotten x86_64 building (with necessary patches upstreamed) by requiring compiler support for that GNU C extension. This was right around the time of meltdown+spectre, and the x86_64 maintainers didn't want to support fallbacks for older versions of GCC (and ToT Clang at the time) that lacked `asm goto` support for the initial fixes shipped under duress (embargo). `asm goto` requires plumbing throughout the compiler, and I've learned more about register allocation than I particularly care...
Fixing some UB in the kernel sources, lots of plumbing to the build system (particularly making it more hermetic).
Getting the rest of the LLVM binutils substitutes to work in place of GNU binutils was also challenging. Rewriting a fair amount of 32b ARM assembler to be "unified syntax" in the kernel. Linker bugs are hard to debug. Kernel boot failures are hard to debug (thank god for QEMU+gdb protocol). Lots of people worked on many different parts here, not just me.
Evangelism and convincing upstream kernel developers why clang support was worth anyones while.
https://github.com/ClangBuiltLinux/linux/issues for a good historical perspective. https://github.com/ClangBuiltLinux/linux/wiki/Talks,-Present... for talks on the subject. Keynoting LLVM conf was a personal highlight (https://www.youtube.com/watch?v=6l4DtR5exwo).
phillmv a day ago
i mean… your work also went into the training set, so it's not entirely surprising that it spat a version back out!
underdeserver a day ago
Anthropic's version is in Rust though, so at least a little different.
ndesaulniers a day ago
yoz-y a day ago
rwmj a day ago
GaggiX a day ago
Clang is not written in Rust tho
underdeserver a day ago
TZubiri 17 hours ago
>Is the generated code correct? The jury is still out on that one for production compilers. And then you have performance of generated code.
It's worth noting that this was developed by compiling Linux and running tests, so at least that is part of the training set and not the testing set.
But at least for linux, I'm guessing the tests are very robust and I'm guessing that will work correctly. That said, if any bugs pop up, it will show weak points in the linux tests.
ur-whale 11 hours ago
> This LLM did it
You do realize the LLM had access (via his training set) and "reused" (not as is, of course) your own work, right?
jbjbjbjb a day ago
It’s cool but there’s a good chance it’s just copying someone else’s homework albeit in an elaborate round about way.
nomel a day ago
I would claim that LLMs desperately need proprietary code in their training, before we see any big gains in quality.
There's some incredible source available code out there. Statistically, I think there's a LOT more not so great source available code out there, because the majority of output of seasoned/high skill developers is proprietary.
To me, a surprising portion of Claude 4.5 output definitely looks like student homework answers, because I think that's closer to the mean of the code population.
dcre 18 hours ago
bearjaws 20 hours ago
typ 20 hours ago
bhadass a day ago
andai a day ago
wvenable 21 hours ago
This is cool and actually demonstrates real utility. Using AI to take something that already exists and create it for a different library / framework / platform is cool. I'm sure there's a lot of training data in there for just this case.
But I wonder how it would fare given a language specification for a non-existent non-trivial language and build a compiler for that instead?
nmstoker 21 hours ago
luke5441 a day ago
It looks like a much more progressed/complete version of https://github.com/kidoz/smdc-toolchain/tree/master/crates/s... . But that one is only a month old. So a bit confused there. Maybe that was also created via LLM?
nlawalker 21 hours ago
I see that as the point that all this is proving - most people, most of the time, are essentially reinventing the wheel at some scope and scale or another, so we’d all benefit from being able to find and copy each others’ homework more efficiently.
computerex 20 hours ago
And the goal post shifts.
kreelman 21 hours ago
..A small thing, but it won't compile the RISCV version of hello.c if the source isn't installed on the machine it's running on.
It is standing on the shoulders of giants (all of the compilers of the past, built into it's training data... and the recent learnings about getting these agents to break up tasks) to get itself going. Still fairly impressive.
On a side-quest, I wonder where Anthropic is getting there power from. The whole energy debacle in the US at the moment probably means it made some CO2 in the process. Would be hard to avoid?
eek2121 a day ago
Also: a large amount of folks seem to think Claude code is losing a ton of money. I have no idea where the final numbers land, however, if the $20,000 figure is accurate and based on some of the estimates I've seen, they could've hired 8 senior level developers at a quarter million a year for the same amount of money spent internally.
Granted, marketing sucks up far too much money for any startup, and again, we don't know the actual numbers in play, however, this is something to keep in mind. (The very same marketing that likely also wrote the blog post, FWIW).
willsmith72 a day ago
this doesn't add up. the 20k is in API costs. people talk about CC losing money because it's way more efficient than the API. I.e. the same work with efficient use of CC might have cost ~$5k.
but regardless, hiring is difficult and high-end talent is limited. If the costs were anywhere close to equivalent, the agents are a no-brainer
NitpickLawyer 11 hours ago
majormajor 21 hours ago
GorbachevyChase 21 hours ago
Even if the dollar cost for product created was the same, the flexibility of being able to spin a team up and down with an API call is a major advantage. That AI can write working code at all is still amazing to me.
bloaf 20 hours ago
This thing was done in 2 weeks. In the orgs I've worked in, you'd be lucky to get HR approval to create a job posting within 2 weeks.
NitpickLawyer a day ago
This is a much more reasonable take than the cursor-browser thing. A few things that make it pretty impressive:
> This was a clean-room implementation (Claude did not have internet access at any point during its development); it depends only on the Rust standard library. The 100,000-line compiler can build Linux 6.9 on x86, ARM, and RISC-V. It can also compile QEMU, FFmpeg, SQlite, postgres, redis
> I started by drafting what I wanted: a from-scratch optimizing compiler with no dependencies, GCC-compatible, able to compile the Linux kernel, and designed to support multiple backends. While I specified some aspects of the design (e.g., that it should have an SSA IR to enable multiple optimization passes) I did not go into any detail on how to do so.
> Previous Opus 4 models were barely capable of producing a functional compiler. Opus 4.5 was the first to cross a threshold that allowed it to produce a functional compiler which could pass large test suites, but it was still incapable of compiling any real large projects.
And the very open points about limitations (and hacks, as cc loves hacks):
> It lacks the 16-bit x86 compiler that is necessary to boot [...] Opus was unable to implement a 16-bit x86 code generator needed to boot into 16-bit real mode. While the compiler can output correct 16-bit x86 via the 66/67 opcode prefixes, the resulting compiled output is over 60kb, far exceeding the 32k code limit enforced by Linux. Instead, Claude simply cheats here and calls out to GCC for this phase
> It does not have its own assembler and linker;
> Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.
Ending with a very down to earth take:
> The resulting compiler has nearly reached the limits of Opus’s abilities. I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.
All in all, I'd say it's a cool little experiment, impressive even with the limitations, and a good test-case as the author says "The resulting compiler has nearly reached the limits of Opus’s abilities". Yeah, that's fair, but still highly imrpessive IMO.
geraneum a day ago
> This was a clean-room implementation
This is really pushing it, considering it’s trained on… internet, with all available c compilers. The work is already impressive enough, no need for such misleading statements.
raincole a day ago
It's not a clean-room implementation, but not because it's trained on the internet.
It's not a clean-room implementation because of this:
> The fix was to use GCC as an online known-good compiler oracle to compare against
Calavar a day ago
array_key_first a day ago
GorbachevyChase 21 hours ago
https://arxiv.org/abs/2505.03335
Check out the paper above on Absolute Zero. Language models don’t just repeat code they’ve seen. They can learn to code give the right training environment.
TacticalCoder a day ago
I'm using AI to help me code and I love Anthropic but I chocked when I read that in TFA too.
It's all but a clean-room design. A clean-room design is a very well defined term: "Clean-room design (also known as the Chinese wall technique) is the method of copying a design by reverse engineering and then recreating it without infringing any of the copyrights associated with the original design."
https://en.wikipedia.org/wiki/Clean-room_design
The "without infringing any of the copyrights" contains "any".
We know for a fact that models are extremely good at storing information with the highest compression rate ever achieved. It's not because it's typically decompressing that information in a lossy way that it didn't use that information in the first place.
Note that I'm not saying all AIs do is simply compress/decompress information. I'm saying that, as commenters noted in this thread, when a model was caught spotting out Harry Potter verbatim, there is information being stored.
It's not a clean-room design, plain and simple.
cryptonector 20 hours ago
Hmm... If Claude iterated a lot then chances are very good that the end result bears little resemblance to open source C compilers. One could check how much resemblance the result actually bears to open source compilers, and I rather suspect that if anyone does check they'll find it doesn't resemble any open source C compiler.
iberator 15 hours ago
this. last sane person in HN
antirez a day ago
The LLM does not contain a verbatim copy of whatever it saw during the pre-training stage, it may remember certain over-represented parts, otherwise it has a knowledge about a lot of things but such knowledge, while about a huge amount of topics, is similar to the way you could remember things you know very well. And, indeed, if you give it access to internet or the source code of GCC and other compilers, it will implement such a project N times faster.
halxc a day ago
majormajor 20 hours ago
PunchyHamster a day ago
modeless a day ago
There seem to still be a lot of people who look at results like this and evaluate them purely based on the current state. I don't know how you can look at this and not realize that it represents a huge improvement over just a few months ago, there have been continuous improvements for many years now, and there is no reason to believe progress is stopping here. If you project out just one year, even assuming progress stops after that, the implications are staggering.
zamadatix a day ago
The improvements in tool use and agentic loops have been fast and furious lately, delivering great results. The model growth itself is feeling more "slow and linear" lately, but what you can do with models as part of an overall system has been increasing in growth rate and that has been delivering a lot of value. It matters less if the model natively can keep infinite context or figure things out on its own in one shot so long as it can orchestrate external tools to achieve that over time.
LinXitoW 19 hours ago
The main issue with improvements in the last year is that a lot of it is based not on the models strictly becoming better, but on tooling being better, and simply using a fuckton more tokens for the same task.
Remember that all these companies can only exist because of massive (over)investments in the hope of insane returns and AGI promises. While all these improvements (imho) prove the exact opposite: AGI is absolutely not coming, and the investments aren't going to generate these outsized returns. The will generate decent returns, and the tools are useful.
modeless 16 hours ago
nozzlegear a day ago
Every S-curve looks like an exponential until you hit the bend.
NitpickLawyer a day ago
raincole a day ago
esafak 17 hours ago
famouswaffles 19 hours ago
chasd00 a day ago
i have to admit, even if model and tooling progress stopped dead today the world of software development has forever changed and will never go back.
uywykjdskn a day ago
Yea the software engineering profession is over, even if all improvements stop now.
gmueckl a day ago
The result is hardly a clean room implementation. It was rather a brute force attempt to decompress fuzzily stored knowledge contained within the network and it required close steering (using a big suite of tests) to get a reasonable approximation to the desired output. The compression and storage happened during the LLM training.
Prove this statement wrong.
libraryofbabel a day ago
Nobody disputes that the LLM was drawing on knowledge in its training data. Obviously it was! But you'll need to be a bit more specific with your critique, because there is a whole spectrum of interpretations, from "it just decompressed fuzzily-stored code verbatim from the internet" (obviously wrong, since the Rust-based C compiler it wrote doesn't exist on the internet) all the way to "it used general knowledge from its training about compiler architecture and x86 and the C language."
Your post is phrased like it's a two sentence slam-dunk refutation of Anthropic's claims. I don't think it is, and I'm not even clear on what you're claiming precisely except that LLMs use knowledge acquired during training, which we all agree on here.
nicoburns a day ago
gmueckl a day ago
NitpickLawyer a day ago
> Prove this statement wrong.
If all it takes is "trained on the Internet" and "decompress stored knowledge", then surely gpt3, 3.5, 4, 4.1, 4o, o1, o3, o4, 5, 5.1, 5.x should have been able to do it, right? Claude 2, 3, 4, 4.1, 4.5? Surely.
shakna a day ago
gmueckl a day ago
geraneum a day ago
hn_acc1 a day ago
Marha01 a day ago
Even with 1 TB of weights (probable size of the largest state of the art models), the network is far too small to contain any significant part of the internet as compressed data, unless you really stretch the definition of data compression.
jesse__ a day ago
kgeist a day ago
gmueckl a day ago
0xCMP a day ago
I challenge anyone to try building a C compiler without a big suite of tests. Zig is the most recent attempt and they had an extensive test suite. I don't see how that is disqualifying.
If you're testing a model I think it's reasonable that "clean room" have an exception for the model itself. They kept it offline and gave it a sandbox to avoid letting it find the answers for itself.
Yes the compression and storage happened during the training. Before it still didn't work; now it does much better.
hn_acc1 a day ago
brutalc a day ago
No one needs to prove you wrong. That’s just personal insecurity trying to justify ones own worth.
panzi a day ago
> clean-room implementation
Except its trained on all source out there, so I assume on GCC and clang. I wonder how similar the code is to either.
pertymcpert 15 hours ago
I'm familiar with both compilers. There's more similarity to LLVM, it even borrows some naming such as mem2reg (which doesn't really exist anymore) and GetElementPtr. But that's pretty much where things end. The rest of it is just common sense.
shubhamjain 13 hours ago
Yeah, I am amazed how people are brushing this off simply because GCC exists. This was far more challenging task than the browser thing, because of how far few open source compilers are there. Add to that no internet access and no dependencies.
At this point, it’s hard to deny that AI has become capable of completing extremely difficult tasks, provided it has enough time and tokens.
bjackman 13 hours ago
I don't think this is more challenging than the browser thing. The scope is much smaller. The fact that this is "only" 100k lines is evidence for this. But, it's still very impressive.
I think this is Anthropic seeing the Cursor guy's bullshit and saying "but, we need to show people that the AI _can actually_ do very impressive shit as long as you pick a more sensible goal"
kelnos a day ago
Honestly I don't find it that impressive. I mean, it's objectively impressive that it can be done at all, but it's not impressive from the standpoint of doing stuff that nearly all real-world users will want it to do.
The C specification and Linux kernel source code are undoubtedly in its training data, as are texts about compilers from a theoretical/educational perspective.
Meanwhile, I'm certain most people will never need it to perform this task. I would be more interested in seeing if it could add support for a new instruction set to LLVM, for example. Or perhaps write a complier for a new language that someone just invented, after writing a first draft of a spec for it.
steveklabnik a day ago
> Or perhaps write a complier for a new language that someone just invented, after writing a first draft of a spec for it.
Hello, this is what I did over my Christmas break. I've been taking some time to do other things, but plan on returning to it. But this absolutely works. Claude has written far more programs in my language than I have.
https://rue-lang.dev/ if you want to check it out. Spec and code are both linked there.
simonw a day ago
Are you a frequent user of coding agents?
I ask because, as someone who uses these things every day, the idea that this kind of thing only works because of similar projects in the training data doesn't fit my mental model of how they work at all.
I'm wondering if the "it's in the training data" theorists are coding agent practitioners, or if they're mainly people who don't use the tools.
bdangubic a day ago
chamomeal 19 hours ago
What's making these models so much better on every iteration? Is it new data? Different training methods?
Kinda waiting for them to plateau so I can stop feeling so existential ¯\_(ツ)_/¯
esafak 16 hours ago
More compute (bigger models, and prediction-time scaling), algorithmic advances, and ever more data (including synthetic).
Remember that all white collar workers are in your position.
dyauspitr a day ago
> Claude did not have internet access at any point during its development
Why is this even desirable? I want my LLM to take into account everything there is out there and give me the best possible output.
simonw a day ago
It's desirable if you're trying to build a C compiler as a demo of coding agent capabilities without all of the Hacker News commenters saying "yeah but it could just copy implementation details from the internet".
andrewshawcare 19 hours ago
It used the best tests it could find for existing compilers. This is effectively steering Claude to a well-defined solution.
Hard to find fully specified problems like this in the wild.
I think this is more a testament to small, well-written tests than it is agent teams. I imagine you could do the same thing with any frontier model and a single agent in a linear flow.
I don’t know why people use parallel agents and increase accidental complexity. Isn’t one agent fast enough? Why lose accuracy over +- one week to write a compiler?
> Write extremely high-quality tests
> Claude will work autonomously to solve whatever problem I give it. So it’s important that the task verifier is nearly perfect, otherwise Claude will solve the wrong problem. Improving the testing harness required finding high-quality compiler test suites, writing verifiers and build scripts for open-source software packages, and watching for mistakes Claude was making, then designing new tests as I identified those failure modes.
> For example, near the end of the project, Claude started to frequently break existing functionality each time it implemented a new feature. To address this, I built a continuous integration pipeline and implemented stricter enforcement that allowed Claude to better test its work so that new commits can’t break existing code.
tantalor 18 hours ago
Why didn't Claude realize on its own that it needed a continuous integration pipeline?
Far to much human intervention here.
sublimefire 13 hours ago
> Isn’t one agent fast enough? Why lose accuracy over +- one week to write a compiler?
My thinking as well, IMO it is because you need to wait for results for longer. You basically want to shorten the loops to improve the system. It hints at a problem that most of what we see is a challenge to seed a good context for it to successfully do something in many iterations.
krzat 14 hours ago
You know what else is well specified? LLM improving on itself.
widdershins 14 hours ago
I wouldn't describe intelligence as well specified. We can't even agree on what it is.
GalaxyNova 19 hours ago
> Hard to find fully specified problems like this in the wild.
This is such a big and obvious cope. This is obviously a very real problem in the wild and there are many, many others like it. Probably most problems are like this honestly or can be made to be like this.
anematode 18 hours ago
Impressive, my sarcasm/bait detector almost failed me.
hmry a day ago
If I, a human, read the source code of $THING and then later implement my own version, that's not a "clean-room" re-implementation. The whole point of "clean-room" is that no single person has access to both the original code and the new code. (That way, you can legally prove that no copyright infringement took place.)
But when an AI does it, now it counts? Opus is trained on the source code of Clang, GCC, TCC, etc. So this is not "clean-room".
astrange 12 hours ago
Copyright doesn't protect ideas, it protects writing. Avoiding reading LLVM or GCC is to protect you from other kinds of IP issues, but it's not a copyright issue. The same people contribute to both projects despite their different licenses.
hmry 9 hours ago
They don't call Clang a "clean-room implementation". Unlike Anthropic, who are calling their project exactly that
A clean-room implementation is when you implement a replacement by only looking at the behavior and documentation (possibly written by another person on your team who is not allowed to write code, only documentation).
bmandale a day ago
That's not the only way to protect yourself from accusations of copyright infringement. I remember reading that the GNU utils were designed to be as performant as possible in order to force themselves to structure the code differently from the unix originals.
Crestwave 21 hours ago
Yes, but Anthropic is specifically claiming their implementation is clean-room, while GNU never made that claim AFAIK.
whinvik a day ago
It's weird to see the expectation that the result should be perfect.
All said and done, that its even possible is remarkable. Maybe these all go into training the next Opus or Sonnet and we start getting models that can create efficient compilers from scratch. That would be something!
regularfry a day ago
This is firmly where I am. "The wonder is not how well the dog dances, it is that it dances at all."
sumitkumar 4 hours ago
I was also startled when I learned about the human ancestor who was the first to see a mirror.
The brilliance of AI is that it copies(mirrors) imperfectly and you can only look at part_of_the_copy(inference) at a time.
the8472 a day ago
"It's like if a squirrel started playing chess and instead of "holy shit this squirrel can play chess!" most people responded with "But his elo rating sucks""
LinXitoW 18 hours ago
knollimar 21 hours ago
emp17344 7 hours ago
amlib 21 hours ago
echelon 17 hours ago
viccis 15 hours ago
>It's weird to see the expectation that the result should be perfect.
Given that they spent $20k on it and it's basically just advertising targeted at convincing greedy execs to fire as many of us as they can, yeah it should be fucking perfect.
minimaxir a day ago
A symptom of the increasing backlash against generative AI (both in creative industries and in coding) is that any flaw in the resulting product is predicate to call it AI slop, even if it's very explicitly upfront that it's an experimental demo/proof of concept and not the NEXT BIG THING being hyped by influencers. That nuance is dead even outside of social media.
stonogo a day ago
AI companies set that expectation when their CEOs ran around telling anyone who would listen that their product is a generational paradigm shift that will completely restructure both labor markets and human cognition itself. There is no nuance in their own PR, so why should they benefit from any when their product can't meet those expectations?
minimaxir a day ago
itay-maman a day ago
My first reaction: wow, incredible.
My second reaction: still incredible, but noting that a C compiler is one of the most rigorously specified pieces of software out there. The spec is precise, the expected behavior is well-defined, and test cases are unambiguous.
I'm curious how well this translates to the kind of work most of us do day-to-day where requirements are fuzzy, many edge cases are discovered on the go, and what we want to build is a moving target.
ndesaulniers a day ago
> C compiler is one of the most rigorously specified pieces of software out there
/me Laughs in "unspecified behavior."
ori_b a day ago
There's undefined behavior, which is quite well specified. What do you mean by unspecified behavior? Do you have an example?
ndesaulniers 15 hours ago
irishcoffee a day ago
Undefined is absolutely clear in the spec.
Unspecified is whatever you want it to mean. I am also laughing, having never heard "unspecified" before.
LiamPowell 21 hours ago
astrange 12 hours ago
The C spec is certainly not formal or precise.
https://www.ralfj.de/blog/2020/12/14/provenance.html
Another example is that it's unclear from the standard if you can write malloc() in C.
butterNaN 8 hours ago
Sure but the point OP is making is that it is still more spec'd than most real world problems
astrange 2 hours ago
cryptonector 20 hours ago
> My second reaction:
This is the key: the more you constrain the LLM, the better it will perform. At least that's my experience with Claude. When working with existing code, the better the code to begin with, the better Claude performs, while if the code has issues then Claude can end up spinning its wheels.
softwaredoug a day ago
Yes I think any codegen with a lot of tests and verification is more about “fitting” to the tests. Like fitting an ML model. It’s model training, not coding.
But a lot of programming we discover correctness as we go, one reason humans don’t completely exit the loop. We need to see and build tests as we go, giving them particular care and attention to ensure they test what matters.
uywykjdskn a day ago
The agent can obviously do that
psychoslave 11 hours ago
>The fix was to use GCC as an online known-good compiler oracle to compare against.
>This was a clean-room implementation (Claude did not have internet access at any point during its development); it depends only on the Rust standard library.
How does one re-conciliate both of this statements? Sure one can fetch all of gnu.org in local, and a model which already scrapped the whole internet somehow already integrated it in its weights, didn’t it?
The worldwide median household income (as of 2013 data from Gallup) was approximately $9,733 per year (in PPP, current international dollars). This means that $20,000 per year is more than double the global median income.
A median Luxembourg citizen earns $20,000 in about 5 to 6 months of work, a Burundi one would on median need 42.5 months, that is 3.5 years.
https://worldpopulationreview.com/country-rankings/median-in...
a456463 5 hours ago
Thank you!!! All these resources being spent on centralizing and claiming to outsource and reduce human thinking to nothing.
boring-human 16 hours ago
People focused on the flaws are missing the picture. Opus wasn't even trained to be "a member of a team of engineers," it was adapted to the task by one person with a shell script loop. Specific training for this mode of operation is inevitable. And model "IQ" is increasing with every generation. If human IQ is increasing at all, it's only because the engineer pool is shrinking more at one end than the other.
This is a five-alarm fire if you're a SWE and not retiring in the next couple years.
smithcoin 7 hours ago
> This is a five-alarm fire if you're a SWE and not retiring in the next couple years.
I’m sorry, but this is such a hype beast take. In my opinion this is equivalent to telling people not to learn to drive five years ago because of self driving from Tesla. How is that going?
Every single line of code produced is a liability. This idea that you’re going to have “gas town” like agents running and building apps without humans in the loop at any point to generate liability free revenue is insane to me.
Are humans infallible? Obviously not. But if you are telling me that ‘magic probability machines’ are creating safe, secure, and compliant software that has no need for engineers to participate in the output- first I’d like to see a citation and second I have a bridge to sell you.
boring-human 6 hours ago
> In my opinion this is equivalent to telling people not to learn to drive five years ago because of self driving
Self-driving has different economics. We're reading tea leaves, true, but it's also true that software has zero marginal cost and that $20K pays for an engineer-month in SF.
> Every single line of code produced is a liability.
Do you have a hard spec and rock-solid test cases? If you do, you have two options to a working prototype: 2-6 engineer-years, or $20K. The second option will greatly increase in quality and likely decrease in price over the next few years.
What if the spec and the test cases are the new software? Assembly programmers used to make an argument against compiled code that's somewhat parallel to yours: every instruction is a (performance) liability.
> without humans in the loop
There will be humans, just fewer and fewer. The spec and test cases are AI-eligible too.
> safe, secure, and compliant software
I'm not sure humans' advantage here is safe, if it even exists still.
_lunix 2 hours ago
The comments at [1] are a bit _too_ trollish for me, but they _do_ showcase that this compiler is far too lenient on what it accepts to the point where I'd hesitate to call it ... a C compiler (This [2] comment in particular is pretty damning).
Still, an impressive achievement nonetheless, but there's a lot of nuance under the surface.
[1] https://github.com/anthropics/claudes-c-compiler/issues/1
[2] https://github.com/anthropics/claudes-c-compiler/issues/1#is...
201984 a day ago
Philpax a day ago
The issue is that it's missing the include paths. The compiler itself is fine.
krupan a day ago
Thank you. That was a long article that started with a claim that was backed up by no proof, dismissing it as not the most interesting thing they were talking about when in fact it's the baseline of the whole discussion.
Retr0id a day ago
Looks like these users are just missing glibc-devel or equivalent?
delusional a day ago
Naa, it looks like it's failing to include the standard system include directories. If you take then from gcc and pass them as -I, it'll compile.
Retr0id a day ago
zamadatix a day ago
worldsavior a day ago
AI is the future.
suddenlybananas a day ago
This is truly incredible.
ZeWaka a day ago
lol, lmao
btown a day ago
> This was a clean-room implementation (Claude did not have internet access at any point during its development); it depends only on the Rust standard library. The 100,000-line compiler can build Linux 6.9 on x86, ARM, and RISC-V. It can also compile QEMU, FFmpeg, SQlite, postgres, redis, and has a 99% pass rate on most compiler test suites including the GCC torture test suite. It also passes the developer's ultimate litmus test: it can compile and run Doom.
This is incredible!
But it also speaks to the limitations of these systems: while these agentic systems can do amazing things when automatically-evaluable, robust test suites exist... you hit diminishing returns when you, as a human orchestrator of agentic systems, are making business decisions as fast as the AI can bring them to your attention. And that assumes the AI isn't just making business assumptions with the same lack of context, compounded with motivation to seem self-reliant, that a non-goal-aligned human contractor would have.
_qua a day ago
Interesting how the concept of a clean room implementation changes when the agent has been trained on the entire internet already
falcor84 a day ago
To the best of my knowledge, there's no Rust-based compiler that comes anywhere close to 99% on the GCC torture test suite, or able to compile Doom. So even if it saw the internals of GCC and a lot of other compilers, the ability to recreate this step-by-step in Rust is extremely impressive to me.
D-Machine 15 hours ago
jsheard a day ago
jillesvangurp 14 hours ago
You can use ai coding tools to create test suites, specifications, documentation, etc. And you can use them to scrutinize those, review them, criticize them, etc. Not having a test suite just means you start with creating one. Then the next question of course becomes "for what?".
This indeed puts human prompters in a position where their job is to set the goals, outline the vision, ask for the right things, ask critical questions, and to correct where needed.
Human contractors are a good analogy. Because they tend to come in without too much context into a new job. Their context is mainly what they've done before. But it takes time to get up to speed with whatever the customer is asking for and their context. People are slightly better at getting information out of other people. AI coding tools don't ask enough critical questions, yet. But that sounds fixable. The breakthroughs here are as much in the feedback loops and plumbing around the models as they are in the models themselves. It's all about getting the right information in and out of the context.
socalgal2 11 hours ago
You would spend years verifying the tests actually work where as the tests for this accomplishment were already verified by humans over decades
falcor84 a day ago
Agreed, but the next step is of having an AI agent actually run the business and be able to get the business context it needs as a human would. Obviously we're not quite there, but with the rapid progress on benchmarks like Vending-Bench [0], and especially with this teams approach, it doesn't seem far fetched anymore.
As a particular near-term step, I imagine that it won't be long before we see a SaaS company using an AI product manager, which can spawn agents to directly interview users as they utilize the app, independently propose and (after getting approval) run small product experiments, and come up with validated recommendations for changing the product roadmap. I still remember Tay, and wouldn't give something like that the keys to the kingdom any time soon, but as long as there's a human decision maker at the end, I think that the tech is already here.
forty a day ago
We live a wonderful time where I can spend hours and $20000 to build a C compiler which is slow and inefficient and anyway requires an existing great compiler to even work, and then neither I nor the agent has any idea on how to make it useful :D
sieep 10 hours ago
Heres $20b in VC funding. Congrats!
OsrsNeedsf2P a day ago
This is like a working version of the Cursor blog. The evidence - it compiling the Linux kernel - is much more impressive than a browser that didn't even compile (until manually intervened)
ben_w a day ago
It certainly slightly spoils what I was planning to be a fun little April Fool's joke (a daft but complete programming language). Last year's AI wasn't good enough to get me past the compiler-compiler even for the most fundamental basics, now it's all this.
I'll still work on it, of course. It just won't be so surprising.
its-kostya 3 hours ago
As cool as the result is, this article is quite tone death to the fact that they asked a statistical model to "build" what was already in its training dataset... And not to mention with troves of forum data discussing bugs and best practices.
akrauss a day ago
I would like to see the following published:
- All prompts used
- The structure of the agent team (which agents / which roles)
- Any other material that went into the process
This would be a good source for learning, even though I'm not ready to spend 20k$ just for replicating the experiment.
a456463 5 hours ago
Just claims with nothing to back it. Steal people's work of years, and turn around be like I make it "so much better". Support this compiler for 20 years then
password4321 a day ago
Yes unfortunately these days most are satisfied with just the sausage and no details about how it was made.
underdeserver a day ago
> when agents started to compile the Linux kernel, they got stuck. [...] Every agent would hit the same bug, fix that bug, and then overwrite each other's changes.
> [...] The fix was to use GCC as an online known-good compiler oracle to compare against. I wrote a new test harness that randomly compiled most of the kernel using GCC, and only the remaining files with Claude's C Compiler. If the kernel worked, then the problem wasn’t in Claude’s subset of the files. If it broke, then it could further refine by re-compiling some of these files with GCC. This let each agent work in parallel
This is a remarkably creative solution! Nicely done.
lubujackson a day ago
This is very much a "vibe coding can build you the Great Pyramids but it can't build a cathedral" situation, as described earlier today: https://news.ycombinator.com/item?id=46898223
I know this is an impressive accomplishment and is meant to show us the future potential, but it achieves big results by throwing an insane amount of compute at the problem, brute forcing its way to functionality. $20,000 set on fire, at Claude's discounted Max pricing no less.
Linear results from exponential compute is not nothing, but this certain feels like a dead end approach. The frontier should be more complexity for less compute, not more complexity from an insane amount more compute.
Philpax a day ago
> $20,000 in API costs
I would interpret this as being at API pricing. At subscription pricing, it's probably at most 5 or 6 Max subscriptions worth.
ajross a day ago
> $20,000 set on fire
To be fair, that's two weeks of the employer cost of a FAANG engineer's labor. And no human hacks a working compiler in two weeks.
It's a lot of AI compute for a demo, sure. But $20k stunts are hardly unique. Clearly there's value being demonstrated here.
lionkor 15 hours ago
Yes a human can hack together a compiler in two weeks.
If you can't, you should turn off the AI and learn for yourself for a while.
Writing a compiler is not a flex; it's a couple very well understood problems, most of which can be solved using existing libraries.
Parsing is solved with yacc, bison, or sitting down and writing a recursive descent parser (works for most well designed languages you can think of).
Then take your AST and translate it to an IR, and then feed that into anything that generates code. You could use crainlift or whatever it's called, you could roll your own.
Anon1096 12 hours ago
ajross 8 hours ago
pcloadlett3r 20 hours ago
Is there really value being presented here? Is this codebase a stable enough base to continue developing this compiler or does it warrant a total rewrite? Honest question, it seems like the author mentioned it being at its limits. This mirrors my own experience with Opus in that it isn't that great at defining abstractions in one-shot at least. Maybe with enough loops it could converge but I haven't seen definite proof of that in current generation with these ambitious clickbaity projects.
segh 6 hours ago
ajross 19 hours ago
a456463 5 hours ago
Humans can hack a compiler in much less. Stop reading this hype and focus on learning
ks2048 a day ago
It's cool that you can look at the git history to see what it did. Unfortunately, I do not see any of the human written prompts (?).
First 10 commits, "git log --all --pretty=format:%s --reverse | head",
Initial commit: empty repo structure
Lock: initial compiler scaffold task
Initial compiler scaffold: full pipeline for x86-64, AArch64, RISC-V
Lock: implement array subscript and lvalue assignments
Implement array subscript, lvalue assignments, and short-circuit evaluation
Add idea: type-aware codegen for correct sized operations
Lock: type-aware codegen for correct sized operations
Implement type-aware codegen for correct sized operations
Lock: implement global variable support
Implement global variable support across all three backendsc-linkage 8 hours ago
That's crazy to me. At this point, I don't even know if the git commit log would be useful to me as a human.
Maybe it's just me, but I like to be able to do both incremental testing and integration testing as I develop. This means I would start with the lexer and parser and get them tested (separately and together) before moving on to generating and validating IR.
It looks like the AI is dumping an entire compiler in one commit. I'm not even sure where I would begin to look if I were doing a bug hunt.
YMMV. I've been a solo developer for too many years. Not that I avoided working on a team, but my teams have been so small that everything gets siloed pretty quickly. Maybe life is different when more than one person works on the same application.
gignico a day ago
> To stress test it, I tasked 16 agents with writing a Rust-based C compiler, from scratch, capable of compiling the Linux kernel. Over nearly 2,000 Claude Code sessions and $20,000 in API costs, the agent team produced a 100,000-line compiler that can build Linux 6.9 on x86, ARM, and RISC-V.
If you don't care about code quality, maintainability, readability, conformance to the specification, and performance of the compiler and of the compiled code, please, give me your $20,000, I'll give you your C compiler written from scratch :)
chasd00 a day ago
> If you don't care about code quality, maintainability, readability, conformance to the specification, and performance of the compiler and of the compiled code, please, give me your $20,000, I'll give you your C compiler written from scratch :)
i don't know if you could. Let's say you get a check for $20k, how long will it take you to make an equivalent performing and compliant compiler? Are you going to put your life on pause until it's done for $20k? Who's going to pay your bills when the $20k is gone after 3 months?
pja 12 hours ago
There are plenty of people on HN who could re-implement a C compiler like this in less than three months. Algorithmically compilers like this are a solved problem that has been very well documented over the last sixty or seventy years. Implementing a small compiler is a typical MSc project that you might carry out in a couple of months alongside a taught masters.
This compiler is both slower than gcc even when optimising (you can’t actually turn optimisation off) & doesn’t reject type incorrect code so will happily accept illegal C code. It’s also apparently very brittle - what happens if you feed it the Linux kernel sources v. 6.10 instead of 6.9? - presumably it fails.
All of the above make it simultaneously 1) really, really impressive and 2) completely useless in the real world. Great for creating discussion though!
nnevatie 10 hours ago
> Who's going to pay your bills when the $20k is gone after 3 months?
And who's going to maintain this turd the LLM pushed out? It's a cool one-shot sort of thing, but let's not pretend this is useful as a real compiler or something anyone would like to maintain, as a human.
One could keep improving one the implementation by vibing more, but I think that's just taking you to the wrong direction of the rabbit hole.
minimaxir a day ago
There is an entire Evaluation section that addresses that criticism (both in agreement and disagreement).
52-6F-62 a day ago
If we're just writing off the billions in up front investment costs, they can just send all that my way while we're at it. No problem. Everybody happy.
marklsnyder 6 hours ago
Very cool, but I can't help but wonder how this translates to similarly complex projects where innate knowledge about the domain hasn't been embedded in the LLM via training data. There's a wealth of open source compiler code and related research papers that have been fed to the LLM. It seems like that would advantage the LLM significantly.
phendrenad2 6 hours ago
Not just open-source compilers, but books on compiler design, which have proliferated because every CS professor wants to take a crack at the problem.
travisgriggs 16 hours ago
A C Compiler seems like one of the more straightforward things to have done. Reading this gives me the same vibe as when a magician does a frequently done trick (saw someone in half, etc).
I'd be more interested in letting it have a go at some some of the other "less trodden" paths of computing. Some of the things that would "wow me more":
- Build a BEAM alternative, perhaps in an embedded space
- Build a Smalltalk VM, perhaps in an embedded space, or in WASM
These things are documented at some level, but still require a bit of original thinking to execute and pull off. That would wow me more.
adgjlsfhk1 16 hours ago
if it actually compiles real C correctly, it's pretty impressive. The C standard is a total mess.
8-prime 6 hours ago
Yet we have gcc and clang navigating that mess. From which Opus 4.6 was able to take inspiration.
arkh 14 hours ago
My question would be: what are the myriad other projects you tasked Opus 4.6 to build and it could not get to a point you could kinda-sorta make a post about?
This kind of headline makes me think of p-hacking.
rco8786 9 hours ago
> Claude will work autonomously to solve whatever problem I give it. So it’s important that the task verifier is nearly perfect, otherwise Claude will solve the wrong problem.
I think this is the fundamental thing here with AI. You can spin up infinite agents that can all do....stuff. But how do you keep them from doing the wrong stuff?
Is writing an airtight spec and test harness easier or less time consuming than just keeping a human in the loop and verifying and redirecting as the agents work?
It all still comes back to context management.
Very cool demonstration of the tech though.
yu3zhou4 a day ago
At this point, I genuinely don't know what to learn next to not become obsolete when another Opus version gets released
missingdays a day ago
Learn to fix bugs, it's gonna be more relevant than ever
wiseowise 13 hours ago
That’s already Opus 4.7s main selling point.
RivieraKid a day ago
I agree. I don't understand there are so many software engineers who are excited about this. I would only be excited if I was a founder in addition to being a software engineer.
segh 6 hours ago
People skills :/
danfritz a day ago
Ha yes classic showcase of:
1) obvious green field project 2) well defined spec which will definitely be in the training data 3) an end result which lands you 90% from the finish
Now comes the hard part, the last 10%. Still not impressed here. Since fixing issues in the end was impossible without introducing bugs I have doubts about quality
I'm glad they do call it out in the end. That's fair
woeirua a day ago
We went from barely able to ask these things to write a function to writes a compiler that actually kind of works in under a year. But sure, keep moving the goal posts!
sieep 10 hours ago
Didn't the Anthropic CEO claim we would be replaced by this AI tech by now? Here's Anthropic moving their own goal post in real time:
2026: https://www.entrepreneur.com/business-news/ai-ceo-says-softw...
2025: https://fortune.com/2025/03/13/ai-transforming-software-deve...
https://www.entrepreneur.com/business-news/anthropic-ceo-pre...
emp17344 7 hours ago
throwaway2027 a day ago
Next time can you build a Rust compiler in C? It doesn't even have to check things or have a borrow checker, as long as it reduces the compile times so it's like a fast debug iteration compiler.
Philpax a day ago
You will experience very spooky behaviour if you do this, as the language is designed around those semantics. Nonetheless, mrustc exists: https://github.com/thepowersgang/mrustc
It will not be noticeably faster because most of the time isn't spent in the checks, it's spent in the codegen. The cranelift backend for rustc might help with this.
geooff_ a day ago
Maybe I'm naive, but I find these re-engineering complex product posts underwhelming. C Compilers exist and realistically Claudes training corpus contains a ton of C Compiler code. The task is already perfectly defined. There exists a benchmark of well-adopted codebases that can be used to prove if this is a working solution. Half the difficulty in making something is proving it works and is complete.
IMO a simpler novel product that humans enjoy is 10x more impressive than rehashing a solved problem, regardless of difficulty.
bs7280 a day ago
I don't see this as just exercise in making a new useful thing, but benchmarking the SOTA models ability to create a massive* project on its own, with some verifiable metrics of success. I believe they were able to build FFMPEG with this rust compiler?
How much would it cost to pay someone to make a C compiler in rust? A lot more than $20k
* massive meaning "total context needed" >> model context window
stephc_int13 a day ago
This is a nice benchmark IMO. I would be curious to see how competitors and improved models would compare.
NitpickLawyer a day ago
And how long will it take before an open model recreates this. The "vibe" consensus before "thinking" models really took off was that open was ~6mo behind SotA. With the massive RL improvements, over the past 6 months I've thought the gap was actually increasing. This will be a nice little verifiable test going forward.
tymonPartyLate 13 hours ago
I try to see this like F1 racing. Building a browser or a C compiler with agent swarms is disconnected from the reality of normal software projects. In normal projects the requirements are not full understood upfront and you learn and adapt and change as you make progress. But the innovations from professional racing result in better cars for everyone. We'll probably get better dev tools and better coding agents thanks to those experiments.
hexo 6 hours ago
I really love how they waste energy for stuff like this. Even better, all that nonsense talk we constantly kept hearing about energy crysis just a few years ago...
a456463 6 hours ago
Yup. All the tech hype bros are like "but my compiler"... Nobody was paying me to write a compiler, the meaning of "clean room" keeps changing, that they had to spend $20k (on the surface), not include the energy costs, the hardware costs, the time of assembly, etc. If you only paid that much money to a person and group of people. It is the hype bros wet dream to extract all value out of people and somehow get rich. Who cares if humanity suffers, look what I built for myself by enslaving people and wasting earth resources. Every single AI fetishist in this thread is responsible for it.
rwmj a day ago
The interesting thing here is what's this code worth (in money terms)? I would say it's worth only the cost of recreation, apparently $20,000, and not very much more. Perhaps you can add a bit for the time taken to prompt it. Anyone who can afford that can use the same prompt to generate another C compiler, and another one and another one.
GCC and Clang are worth much much more because they are battle-tested compilers that we understand and know work, even in a multitude of corner cases, over decades.
In future there's going to be lots and lots of basically worthless code, generated and regenerated over and over again. What will distinguish code that provides value? It's going to be code - however it was created, could be AI or human - that has actually been used and maintained in production for a long time, with a community or company behind it, bugs being triaged and fixed and so on.
kingstnap a day ago
The code isn't worth money. This is an experiment. The knowledge that something like this is even possible is what is worth money.
If you had the knowledge that a transformer could pull this off in 2022. Even with all its flawed code. You would be floored.
Keep in mind that just a few years ago, the state of the art in what these LLMs could do was questions of this nature:
Suppose g(x) = f−1 (x), g(0) = 5, g(4) = 7, g(3) = 2, g(7) = 9, g(9) = 6 what is f(f(f(6)))?
The above is from the "sparks of AGI paper" on GPT-4, where they were floored that it could coherently reason through the 3 steps of inverting things (6 -> 9 -> 7 -> 4) while GPT 3.5 was still spitting out a nonsense argument of this form:
f(f(f(6))) = f(f(g(9))) = f(f(6)) = f(g(7)) = f(9).
This is from March 2023 and it was genuinely very surprising at the time that these pattern matching machines trained on next token prediction could do this. Something like a LSTM can't do anything like this at all btw, no where close.
To me its very surprising that the C compiler works. It takes a ton of effort to build such a thing. I can imagine the flaws actually do get better over the next year as we push the goalposts out.
dzaima 21 hours ago
Clicked on the first thing I happen to be interested in - SIMD stuff - and ended up at https://github.com/anthropics/claudes-c-compiler/blob/6f1b99..., which is a fast path incompatible with the _mm_free implementation; pretty trivial bug, not even actually SIMD or anything specialized at all.
A whole lot of UB in the actual SIMD impls (who'd have expected), but that can actually be fine here if the compiler is made to not take advantage of the UB. And then there's the super-weird mix of manual loops vs inline assembly vs builtins.
epolanski a day ago
However it was achieved, building a such a complex project like a C compiler on a 20k $ budget in full autonomy is quite impressive.
Imho some commenters focus way too much on the (many, and honestly also shared by the blog post too) cons, that they forget to be genuinely impressed by the steps forward.
mshockwave 5 hours ago
how did it do regalloc before instruction selection? How do you select the correct register class without knowing which instruction you're gonna use?
polyglotfacto 7 hours ago
So I do think one can get value from coding agents, but that value is out of proportion compared to the investments made by the AI labs, so now they're pushing this kind of stuff which I find to be a borderline scam.
Let me explain why:
> the resulting compiled output is over 60kb, far exceeding the 32k code limit enforced by Linux
Seems like a failure to me.
> I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.
This has code smell written all over it.
----
Conclusion: this cost 20k to build, not taking into account the money spent on training the model. How much would you pay for this software? Zero.
The reality is that LLM are up there with SQL and ROR(or above) in terms of changing how people write software and interact with data. That's a big deal, but not enough to support trillion dollar valuations.
So you get things like this project, which are just about driving a certain narrative.
conception 7 hours ago
I don’t understand. This badly done work wasn’t possible at all six months ago. In six more months it will be better. It’s not a mostly static technology for the last twenty plus years.
small_model a day ago
How about we get the LLM's to collaborate and design a perfect programming language for LLM coding, it would be terse (less tokens) easy for pattern searches etc and very fast to build, iterate over.
copperx a day ago
I'm surprised by the assumption that LLMs would design such a language better than humans. I don't think that's the case.
WarmWash a day ago
I cannot decide if LLMs would be excellent at writing in pure binary (why waste all that context on superfluous variable names and function symbols) or be absolutely awful at writing pure binary (would get hopelessly lost without the huge diversification of tokens).
anematode a day ago
Binary is wayyy less information dense than normal code, so it wouldn't work well at all.
small_model a day ago
We would still need the language to be human readable, but it could be very dense. They could build the ultimate std lib, that goes directly to kernels, so a call like spawn is all the tokens it needs to start a co routine for example.
hagendaasalpine a day ago
what about APL et al (BQN), information dense(?)
jaccola 10 hours ago
I think this is cool!
But by some definition my "Ctrl", "C", and "V" keys can build a C compiler...
Obviously being facetious but my point being: I find it impossible to judge how impressed I should be by these model achievements since they don't show how they perform on a range of out-of-distribution tasks.
keeptrying 19 hours ago
This is more an example of code distribution rather than intelligence.
If Claude had NOT been trained on compiler code, it would NOT have been able to build a compiler.
Definitely signals the end of software IP or at least in its present form.
stkdump 16 hours ago
In a weird sense Open Source won
keeptrying 9 hours ago
Yep - its an interesting angle to look at it.
Or rather OpenSource might have just saved the world!
sigbottle 9 hours ago
Even with all the caveats:
- trained on all the GCC/clang source - pulled down a kernel branch, presumably with extensive tests in source - used GCC as an oracle
I certainly wouldn't be able to do this.
I flip flop man.
exitcode0000 a day ago
Cool article, interesting to read about their challenges. I've tasked Claude with building an Ada83 compiler targeting LLVM IR - which has gotten pretty far.
I am not using teams though and there is quite a bit of knowledge needed to direct it (even with the test suite).
mimd 5 hours ago
I'm annoyed at the cost statement, as that's the sleight of hand. "$20000" at current pricing. Add some orders of magnitude to the costs and you'll get your true price you'll have to pay when the VC money starts to wear off. 2nd, this is ignoring the dev time that he/others put in over multiple iterations of this project (opus 4, opus 4.5) and all the other work to create the scaffolding for it, and all the millions/tens of millions of dollars of hand written test suits (linux kernel, gcc, doom, sqlite, etc) he got to use to guide the process. So add some more cost on top of that orders of magnitude increase and the dev time is probably a couple months/years more than "2 weeks".
And this is just working off the puff pieces statements, and not even diving into the code to see it's limits/origins, etc. I also don't see the scaffold in the repo, as that's where the effort is.
But still it's not surprising, from my own experience, given a rigorously definable problem, enough effort, grunt work, and massaging, you can get stuff out of the current models.
karmakaze a day ago
I'm not particularly impressed that it can turn C into an SSA IR or assembly etc. The optimizations, however sophisticated is where anything impressive would be. Then again, we have lots of examples in the training set I would expect. C compilers are probably the most popular of all compilers. What would be more impressive is for it to have made a compiler for a well defined language that isn't very close to a popular language.
What I am impressed by is that the task it completed had many steps and the agent didn't get lost or caught in a loop in the many sessions and time it spent doing it.
astrange 12 hours ago
> What would be more impressive is for it to have made a compiler for a well defined language that isn't very close to a popular language.
That doesn't seem difficult as long as you can translate it into a well-known IR. The Dragon Book for some reason spends all its time talking about frontend parsing, which does give you the impression it's impossible.
I agree writing compilers isn't especially difficult, but it is a lot of work and people are scared of it.
The hard part is UI - error handling and things like that.
owenpalmer a day ago
It can compile the linux kernel, but does it boot?
hexagonsuns a day ago
https://youtu.be/vNeIQS9GsZ8?t=16
They posted this video, looks like they used `qemu-system-riscv64` to test.
flakiness a day ago
https://github.com/anthropics/claudes-c-compiler/blob/main/B... claims to have the first line of dmesg (which is shown using dmesg obviously.)
softwaredoug a day ago
I think we’re getting to a place where for anything with extensive verification available we’ll be “fitting” code to a task against tests like we fit an ML model to a loss function.
anupamchugh 17 hours ago
This is a very early research prototype with no other inter-agent communication methods or high-level goal management processes."
The lock file approach (current_tasks/parse_if_statement.txt) prevents two agents from claiming the same task, but it can't prevent convergent wasted work. When all 16 agents hit the same Linux kernel bug, the lock files didn't help — the problem wasn't task collision, it was that the agents couldn't see they were all solving the same downstream failure. The GCC oracle workaround was clever, but it was a human inventing a new harness mid-flight because the coordination primitive wasn't enough.
Similarly, "Claude frequently broke existing functionality implementing new features" isn't a model capability problem — it's an input stability problem. Agent N builds against an interface that agent M just changed. Without gating on whether your inputs have changed since you started, you get phantom regressions
cesaref 12 hours ago
Most of the effort when writing a compiler is handling incorrect code, and reporting sensible error messages. Compiling known good code is a great start though.
miki123211 12 hours ago
What I find to be the most impressive part here is that it wrote the compiler without reference to the C specification and without architecture manuals at hand.
smy20011 15 hours ago
I think the good thing about it is that if you are given good specification, you are likely to get good result. Writing a C compiler is not something new, but it will be great for all the porting projects.
polskibus a day ago
So did the Linux compiled with this compiler worked? Does it work the same as GCC-compiled Linux (but slower due to generating non optimized code?)
storus a day ago
Now this is fairly "easy" as there are multitude of implementations/specs all over the Internet. How about trying to design a new language that is unquestionably better/safer/faster for low-level system programming than C/Rust/Zig? ML is great in aping existing stuff but how about pushing it to invent something valuable instead?
Decabytes 12 hours ago
For me the real test will be building a c++ compiler
throwaway2027 a day ago
I think it's funny how me and I assume many others tried to do the same thing and they probably saw it being a popular query or had the same idea.
subzel0 11 hours ago
One thing this article proved is that the Dead Internet Theory is real. Look at all these Claudy comments!
jgarzik 20 hours ago
Already done, months ago, with better taste: https://github.com/rustcoreutils/posixutils-rs
personjerry a day ago
> Over nearly 2,000 Claude Code sessions and $20,000 in API costs
Well there goes my weekend project plans
degurechaff 20 hours ago
well, you can use jules and spend zero dollar on it. I also create similiar project like this, c11 compiler in rust using AI agent + 1 developer(https://github.com/bungcip/cendol). not fully automated like anthophic did, but at least i can understand what it did.
nottorp 11 hours ago
Apparently there's a reproducibility crisis in science.
Are Anthropic's claims reproducible?
cuechan a day ago
> The compiler is an interesting artifact on its own [...]
its funny bacause by (most) definitions, it is not an artifact:
> a usually simple object (such as a tool or ornament) showing human workmanship or modification as distinguished from a natural object
socalgal2 11 hours ago
Thinking about the this, while it’s a cool achievement, how useful is it really? It realizes on the fact there is a large comprehensive set of tests and a large number of available projects that can function as tests.
That situation is extremely uncommon for most development
jackdoe 14 hours ago
honestly i am amazed that it can do that, but I wish they use it to rewrite the claude code cli.
i had to killall -9 claude 3 times yesterday
rhubarbtree 14 hours ago
They are already writing Claude with Claude - I think they said 90% of their code is written with Claude.
jackdoe 14 hours ago
yes, they must be killing it hundreds of times per day, maybe its time for 'please rewrite opencode, but dont touch anything, you can only use `cp`' kind of prompt
stephc_int13 a day ago
They should add this to the benchmark suite, and create a custom eval for how good the resulting compiler is, as well as how maintainable the source code.
snek_case a day ago
This would be an expensive benchmark to run on a regular basis, though I guess for the big AI labs it's nothing. Code quality is hard to objectively measure, however.
stevefan1999 20 hours ago
I tried writing a C compiler in Rust in the spirit of TCC, but I'm just too lazy to finish it.
jwpapi a day ago
This is my favorite article this year. Just very insightful and honest. The learnings are worth thousands for me.
jhallenworld a day ago
Does it make a conforming preprocessor?
mucle6 a day ago
This feels like the start of a paradigm shift.
I need to reunderwrite what my vision of the future looks like.
jcalvinowens a day ago
How much of this result is effectively plagiarized open source compiler code? I don't understand how this is compelling at all: obviously it can regurgitate things that are nearly identical in capability to already existing code it was explicitly trained on...
It's very telling how all these examples are all "look, we made it recreate a shitter version of a thing that already exists in the training set".
jeroenhd a day ago
The fact it couldn't actually stick to the 16 bit ABI so it had to cheat and call out to GCC to get the system to boot says a lot.
Without enough examples to copy from (despite CPU manuals being available in the training set) the approach failed. I wonder how well it'll do when you throw it a new/imaginary instruction set/CPU architecture; I bet it'll fail in similar ways.
jsnell a day ago
"Couldn't stick to the ABI ... despite CPU manuals being available" is a bizarre interpretation. What the article describes is the generated code being too large. That's an optimization problem, not a "couldn't follow the documentation" problem.
And it's a bit of a nasty optimization problem, because the result is all or nothing. Implementing enough optimizations to get from 60kB to 33kB is useless, all the rewards come from getting to 32kB.
jcalvinowens a day ago
IMHO a new architecture doesn't really make it any more interesting: there's too many examples of adding new architectures in the existing codebases. Maybe if the new machine had some bizarre novel property, I suppose, but I can't come up with a good example.
If the model were retrained without any of the existing compilers/toolchains in its training set, and it could still do something like this, that would be very compelling to me.
Philpax a day ago
What Rust-based compiler is it plagiarising from?
lossolo a day ago
Language doesn't really matter, it's not how things are mapped in the latent space. It only needs to know how to do it in one language.
HDThoreaun a day ago
rubymamis a day ago
There are many, here's a simple Google search:
https://github.com/jyn514/saltwater
jsnell a day ago
Philpax a day ago
luke5441 a day ago
chilipepperhott a day ago
jcalvinowens a day ago
Being written in rust is meaningless IMHO. There is absolutely zero inherent value to something being written in rust. Sometimes it's the right tool for the job, sometimes it isn't.
modeless a day ago
Philpax a day ago
anematode a day ago
Honestly, probably not a lot. Not that many C compilers are compatible with all of GCC's weird features, and the ones that are, I don't think are written in Rust. Hell, even clang couldn't compile the Linux kernel until ~10 years ago. This is a very impressive project.
stephc_int13 a day ago
It means that if you already have or a willing to build very robust test suite and the task is a complicated but already solved problem, you can get a sub-par implementation for a semi-reasonable amount of money.
This is not entirely ridiculous.
tonis2 14 hours ago
I wish they would do llvm from scratch too
IshKebab a day ago
> I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.
This has been my experience of vibe coding too. Good for getting started, but you quickly reach the point where fixing one thing breaks another and you have to finish the project yourself.
sreekanth850 18 hours ago
Much better than cursor's browser fiasco.
lambda-lollipop 16 hours ago
apparently [hello world does not compile...](https://github.com/anthropics/claudes-c-compiler/issues/1)
7734128 a day ago
I'm sure this is impressive, but it's probably not the best test case given how many C compilers there are out there and how they presumably have been featured in the training data.
This is almost like asking me to invent a path finding algorithm when I've been thought Dijkstra's and A*.
NitpickLawyer a day ago
It's a bit disappointing that people are still re-hashing the same "it's in the training data" old thing from 3 years ago. It's not like any LLM could 1for1 regurgitate millions of LoC from any training set... This is not how it works.
A pertinent quote from the article (which is a really nice read, I'd recommend reading it fully at least once):
> Previous Opus 4 models were barely capable of producing a functional compiler. Opus 4.5 was the first to cross a threshold that allowed it to produce a functional compiler which could pass large test suites, but it was still incapable of compiling any real large projects. My goal with Opus 4.6 was to again test the limits.
wmf a day ago
In this case it's not reproducing training data verbatim but it probably is using algorithms and data structures that were learned from existing C compilers. On one hand it's good to reuse existing knowledge but such knowledge won't be available if you ask Claude to develop novel software.
RobMurray a day ago
ofrzeta 16 hours ago
simonw a day ago
This is a good rebuttal to the "it was in the training data" argument - if that's how this stuff works, why couldn't Opus 4.5 or any of the other previous models achieve the same thing?
f311a 14 hours ago
That's because they still struggle hard with out-of-distribution tasks even though some of them can be solved using existing training data pretty well. Focusing on out-of-distribution will probably lower scores for benchmarks. They focus too much on common tasks.
And keep in mind, the original creators of the first compiler had to come up with everything: lexical analysis -> parsing -> IR -> codegen -> optimization. LLMs are not yet capable of producing a lot of novelty. There are many areas in compilers that can be optimized right now, but LLMs can't help with that.
lossolo a day ago
They couldn't do it because they weren't fine-tuned for multi-agent workflows, which basically means they were constrained by their context window.
How many agents did they use with previous Opus? 3?
You've chosen an argument that works against you, because they actually could do that if they were trained to.
Give them the same post-training (recipes/steering) and the same datasets, and voila, they'll be capable of the same thing. What do you think is happening there? Did Anthropic inject magic ponies?
fatherwavelet a day ago
At some point it becomes like someone playing a nice song on piano and then someone countering with "that is great but play a song you don't know!".
Then they start improvising and the same person counters with "what a bunch of slop, just making things up!"
falloutx a day ago
They can literally print out entire books line by line.
zephen a day ago
> It's a bit disappointing that people are still re-hashing the same "it's in the training data" old thing from 3 years ago.
They only have to keep reiterating this because people are still pretending the training data doesn't contain all the information that it does.
> It's not like any LLM could 1for1 regurgitate millions of LoC from any training set... This is not how it works.
Maybe not any old LLM, but Claude gets really close.
skydhash a day ago
Because for all those projects, the effective solution is to just use the existing implementation and not launder code through an LLM. We would rather see a stab at fixing CVEs or implementing features in open source projects. Like the wifi situation in FreeBSD.
Philpax a day ago
modeless a day ago
lunar_mycroft a day ago
LLMs can regurgitate almost all of the Harry Potter books, among others [0]. Clearly, these models can actually regurgitate large amounts of their training data, and reconstructing any gaps would be a lot less impressive than implementing the project truly from scratch.
(I'm not claiming this is what actually happened here, just pointing out that memorization is a lot more plausible/significant than you say)
[0] https://www.theregister.com/2026/01/09/boffins_probe_commerc...
StilesCrisis a day ago
secretsatan 13 hours ago
Who checks to see if it’s backdoored?
logicprog 20 hours ago
I will say that one thing that's extremely interesting is that everyone laughed at and made fun of Steve Yegge when he released Gas Town, which centered exactly around this idea — of having more than a dozen agents working on a project simultaneously with some generalized agents focusing on implementing features while other are more specialized and tasked with second-order tasks, where you just independently run them in a loop from an orchestrator until they've finished the project where they all work on work trees and, you know, satisfy merch conflicts and stuff as a coordination mechanism — but it's starting to kind of look like he was right. He really was aiming for where the puck was headed. First we got cursor with the fast render browser, then we got Kimi K2.5 releasing with — from everything I can tell — actually very innovative and new specific RL techniques for orchestrating agent swarms. And now we have this, Anthropic themselves doing a Gas Town-style agent swarm model of development. It's beginning to look like he absolutely did know where the puck was headed before it got there.
Now, whether we should actually be building software in this fashion or even headed in this direction at all is a completely separate question. And I would tend strongly towards no. Not until at least we have very strong, yet easy to use concise and low effort formal verification, deterministic simulation testing, property-based testing, integration testing, etc; and even then, we'll end up pair programming those formal specifications and batteries of tests with AI agents. Not writing them ourselves, since that's inefficient, nor turning them over to agent swarms, since they are very important. And if we turn them over to swarms, we'd end up with an infinite regress problem. And ultimately, that's just programming at a higher level at that point. So I would argue we should never predominantly develop in this way.
But still, there is prescience in Gastown apparently, and that's interesting.
casey2 17 hours ago
Interesting that they are still going with a testing strategy despite the wasted time. I think in the long run model checking and proofs are more scale-able.
I guess it makes as agents can generate tests, since you are taking this route I'd like to see agents that act as a users, that can only access docs, textbooks, user forums and builds.
sho_hn a day ago
Nothing in the post about whether the compiled kernel boots.
chews a day ago
video does show it booting.
davemp a day ago
Brute forcing a problem with a perfect test oracle and a really good heuristic (how many c compilers are in the training data) is not enough to justify the hype imo.
Yes this is cool. I actually have worked on a similar project with a slightly worse test oracle and would gladly never have to do that sort of work again. Just tedious unfulfilling work. Though we caught issues with both the specifications/test oracle when doing the work. Also many of the team members learned and are now SMEs for related systems.
Is this evidence that knowledge work is dead or AGI is coming? Absolutely not. I think you’d be pretty ignorant with respect to the field to suggest such a thing.
almosthere 21 hours ago
This is like the 6th trending claude story today. It must be obvious that they told everyone at Anthropic to upvote and comment.
light_hue_1 a day ago
> This was a clean-room implementation (Claude did not have internet access at any point during its development);
This is absolutely false and I wish the people doing these demonstrations were more honest.
It had access to GCC! Not only that, using GCC as an oracle was critical and had to be built in by hand.
Like the web browser project this shows how far you can get when you have a reference implementation, good benchmarks, and clear metrics. But that's not the real world for 99% of people, this is the easiest scenario for any ML setting.
rvz a day ago
> This is absolutely false and I wish the people doing these demonstrations were more honest.
That's because the "testing" was not done independently. So anything can be possibly be made to be misleading. Hence:
> Written by Nicholas Carlini, a researcher on our Safeguards team.
gre a day ago
There's a terrible bug where once it compacts then it sometimes pulls in .o or binary files and immediately fills your entire context. Then it compacts again...10m and your token budget is gone for the 5 hour period. edit: hooks that prevent it from reading binary files can't prevent this.
Please fix.. :)
pshirshov a day ago
Pfft, a C compiler.
Look at this: https://github.com/7mind/jopa
Havoc a day ago
Cool project, but they really could have skipped the mention of clean room. Something trained on every copyrighted thing known to mankind is the opposite of clean room
cheema33 a day ago
As others have pointed out, humans train on existing codebases as well. And then use that knowledge to build clean room implementations.
mxey a day ago
That’s the opposite of clean-room. The whole point of clean-room design is that you have your software written by people who have not looked into the competing, existing implementation, to prevent any claim of plagiarism.
“Typically, a clean-room design is done by having someone examine the system to be reimplemented and having this person write a specification. This specification is then reviewed by a lawyer to ensure that no copyrighted material is included. The specification is then implemented by a team with no connection to the original examiners.”
kelnos a day ago
No they don't. One team meticulously documents and specs out what the original code does, and then a completely independent team, who has never seen the original source code, implements it.
Otherwise it's not clean-room, it's plagiarism.
regularfry a day ago
What they don't do is read the product they're clean-rooming. That's kinda disqualifying. Impossible to know if the GCC source is in 4.6's training set but it would be kinda weird if it wasn't.
HarHarVeryFunny 21 hours ago
True, but the human isn't allowed to bring 1TB of compressed data pertaining to what they are "redesigning from scratch/memory" into the clean room.
In fact the idea of a "clean room" implementation is that all you have to go on is the interface spec of what you are trying to build a clean (non-copyright violating) version of - e.g. IBM PC BIOS API interface.
You can't have previously read the IBM PC BIOS source code, then claim to have created a "clean room" clone!
pizlonator a day ago
Not the same.
I have read nowhere near as much code (or anything) as what Claude has to read to get to where it is.
And I can write an optimizing compiler that isn't slower than GCC -O0
cermicelli a day ago
If that's what clean room means to you, I do know AI can definitely replace you. As even ChatGPT is better than that.
(prompt: what does a clean room implementation mean?)
From ChatGPT without login BTW!
> A clean room implementation is a way of building something (usually software) without copying or being influenced by the original implementation, so you avoid copyright or IP issues.
> The core idea is separation.
> Here’s how it usually works:
> The basic setup
> Two teams (or two roles):
> Specification team (the “dirty room”)
> Looks at the original product, code, or behavior
> Documents what it does, not how it does it
> Produces specs, interfaces, test cases, and behavior descriptions
> Implementation team (the “clean room”)
> Never sees the original code
> Only reads the specs
> Writes a brand-new implementation from scratch
> Because the clean team never touches the original code, their work is considered independently created, even if the behavior matches.
> Why people do this
> Reverse-engineering legally
> Avoid copyright infringement
> Reimplement proprietary systems
> Create open-source replacements
> Build compatible software (file formats, APIs, protocols)
I really am starting to think we have achieved AGI. > Average (G)Human Intelligence
LMAO
benjiro a day ago
Hot take:
If you try to reimplement something in a clean room, its a step by step process, using your own accumulated knowledge as the basis. That knowledge that you hold in your brain, all too often is code that may have copyrights on it, from the companies you worked on.
Is it any different for a LLM?
The fact that the LLM is trained on more data, does not change that when you work for a company, leave it, take that accumulated knowledge to a different company, you are by definition taking that knowledge (that may be copyrighted) and implementing it somewhere else. It only a issue if you copy the code directly, or do the implementation as a 1:1 copy. LLMs do not make 1:1 copies of the original.
At what point is trained on copyrighted data, any different then a human trained on copyrighted data, that get reimplemented in a transformative way. The big difference is that the LLM can hold more data over more fields, vs a human, true... But if we look at specializations, this can come back to the same, no?
Crestwave 21 hours ago
Clean-room design is extremely specific. Anyone who has so much as glanced at Windows source code[1] (or even ReactOS code![2]) is permanently banned from contributing to WINE.
This is 100% unambiguously not clean-room unless they can somehow prove it was never trained on any C compiler code (which they can't, because it most certainly was).
[1] https://gitlab.winehq.org/wine/wine/-/wikis/Developer-FAQ#wh...
[2] https://gitlab.winehq.org/wine/wine/-/wikis/Clean-Room-Guide...
cermicelli a day ago
If you have worked on a related copyrighted work you can't work on a clean room implementation. You will be sued. There are lots of people who have tried and found out.
They weren't trillion dollar AI companies to bankroll the defense sure. But thinking about clean room and using copyrighted stuff is not even an argument that's just nonsense to try to prove something when no one asked.
dmitrygr a day ago
> The generated code is not very efficient. Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.
Worse than "-O0" takes skill...
So then, it produced something much worse than tcc (which is better than gcc -O0), an equivalent of which one man can produce in under two weeks. So even all those tokens and dollars did not equal one man's week of work.
Except the one man might explain such arbitrary and shitty code as this:
https://github.com/anthropics/claudes-c-compiler/blob/main/s...
why x9? who knows?!
Oh god the more i look at this code the happier I get. I can already feel the contracts coming to fix LLM slop like this when any company who takes this seriously needs it maintained and cannot...
ben_w a day ago
I'm trying to recall a quote. Some war where all defeats were censored in the news, possibly Paris was losing to someone. It was something along the lines of "I can't help but notice how our great victories keep getting closer to home".
Last year I tried using an LLM to make a joke language, I couldn't even compile the compiler the source code was so bad. Before Christmas, same joke language, a previous version of Claude gave me something that worked. I wouldn't call it "good", it was a joke language, but it did work.
So it sucks at writing a compiler? Yay. The gloriously indefatigable human mind wins another battle against the mediocre AI, but I can't help but notice how the battles keep getting closer to home.
sjsjsbsh a day ago
> but I can't help but notice how the battles keep getting closer to home
This has been true for all of (known) human history. I’m gonna go ahead and make another bold prediction: tech will keep getting better.
The issue with this blog post is it’s mostly marketing.
sebzim4500 a day ago
Can one man really make a C compiler in one week that can compile linux, sqlite, etc.?
Maybe I'm underestimating the simplicity of the C language, but that doesn't sound very plausible to me.
dmitrygr a day ago
yes, if you do not care to optimize, yes. source: done it
Philpax a day ago
bwfan123 a day ago
> I can already feel the contracts coming to fix LLM slop
First, the agents will attempt to fix issues on their own. Most easy problems will be fixed or worked-around in this manner. The hard problems will require a deeper causal model of how things work. For these, the agents will give up. But, the code-base has evolved to a point where no-one understands whats going on including the agents and its human handlers. Expect your phone to ring at that point, and prepare to ask for a ransom.
small_model a day ago
Claude is only a few years old so we should compare it to a 3 year old human's C compiler
notnullorvoid 21 hours ago
Claude requires many lifetimes worth of data to "learn". Evolution aside humans don't require much data to learn, and our learning happens in real-time in response to our environment.
Train Claude without the programming dataset and give it a dozen of the best programming books, it'll have no chance of writing a compiler. Do the same for a human with an interest in learning to program and there's a good chance.
zephen a day ago
Claude contains the entire wisdom of the internet, such as it is.
sjsjsbsh a day ago
> I can already feel the contracts coming to fix LLM slop like this when any company who takes this seriously needs it maintained and cannot
Honest question, do you think it’d be easier to fix or rewrite from scratch? With domains I’m intimately familiar with, I’ve come very close to simply throwing the LLM code out after using it to establish some key test cases.
dmitrygr a day ago
Rewrite is what I’ve been doing so far in such cases. Takes fewer hours
sjsjsbsh a day ago
> So, while this experiment excites me, it also leaves me feeling uneasy. Building this compiler has been some of the most fun I’ve had recently, but I did not expect this to be anywhere near possible so early in 2026
What? Didn’t cursed lang do something similar like 6 or 7 months ago? These bombastic marketing tactics are getting tired.
ebiester a day ago
Do you not see the difference between a toy language and a clean room implementation that can compile Linux, QEMU, Postgres, and sqlite? (No, it doesn't have the assembler and linker.)
That's for $20,000.
falloutx a day ago
people have built compilers for free, with $20000 you can even a couple of devs for a year in low income countries.
jsnell a day ago
No? That was a frontend for a toy language calling using LLVM as the backend. This is a totally self-contained compiler that's capable of compiling the Linux kernel. What's the part that you think is similar?
andrepd 12 hours ago
This chatbot has several C compilers in its training data. How is this possibly a useful benchmark for anything? LLMs routinely output code verbatim or modulo trivial changes as their own (very useful for license-laundering too).
trilogic a day ago
Can it create employment? How is this making life better. I understand the achievement but come on, wouldn´t it be something to show if you created employment for 10000 people using your 20000 USD!
Microsoft, OpenAI, Anthropic, XAI, all solving the wrong problems, your problems not the collective ones.
m4ck_ 18 hours ago
Didn't you hear? We're heading towards a workless utopia where everything will be free (according to people who are actively working to eliminate things like food assistance for less fortunate mothers and children.)
stinkbeetle 12 hours ago
Who are some of those people?
jeffbee a day ago
"Employment" is not intrinsically valuable. It is an emergent property of one way of thinking about economic systems.
wiseowise 12 hours ago
That’s the most HN reply ever. Obtuse and pedantic.
Tell a struggling undergrad or unemployed that “employment” is not intrinsically valuable, maybe they’ll be able to use the rhetoric to move a couple positions higher in a soup kitchen queue before their food coupons expire.
trilogic a day ago
For employment I mean "WHATEVER LEADS TO REWARD COLLECTIVE HUMANS TO SURVIVE".
Call it as you wish, but I am certainly not talking about coding values.
falcor84 a day ago
mofeien a day ago
Obviously a human in the loop is always needed and this technology that is specifically trained to excel at all cognitive tasks that humans are capable of will lead to infinite new jobs being created. /s
bsoles a day ago
The title should have said "Antropic stole GCC and other open-source compiler code to create a subpar, non-functional compiler", without attribution or compensation. Open source was never meant for thieving megacorps like them.
No, I did not read the article...
ur-whale 11 hours ago
> We tasked Opus 4.6 using agent teams to build a C Compiler
So, essentially to build something for which many, many examples already exist on the web, and which is likely baked into its training set somehow ... mmmyeah.
falloutx a day ago
So it copied one of the C compilers? This was always possible but now you need to pay $1000 in API costs to Anthropic
Rudybega a day ago
It wrote the compiler in Rust. As far as I know, there aren't any Rust based C compilers with the same capabilities. If you can find one that can compile the Linux kernel or get 99% on the GCC torture test suite, I would be quite surprised. I couldn't in a search.
Maybe read the article before being so dismissive.
falloutx a day ago
Why does language of the compiler matter? Its a solved problem and since other implementations are already available anyone can already transpile them to rust.
Rudybega a day ago
hgs3 a day ago
> As far as I know, there aren't any Rust based C compilers with the same capabilities.
If you trained on a neutral representation like an AST or IR, then the source language shouldn't matter. *
* I'm not familiar with how Anthropic builds their models, but training this way should nullify PL differences.
astrange 12 hours ago
chucksta a day ago
Add a 0 and double it
|Over nearly 2,000 Claude Code sessions and $20,000 in API cost
lossyalgo a day ago
One more reason RAM prices will continue to go up.
undefined a day ago
chvid a day ago
100.000 lines of code for something that is literally a text book task?
I guess if it only created 1.000 lines it would be easy to see where those lines came from.
falcor84 a day ago
> literally a text book task
Generating a 99% compliant C compiler is not a textbook task in any university I've ever heard of. There's a vast difference between a toy compiler and one that can actually compile Linux and Doom.
From a bit of research now, there are only three other compilers that can compile an unmodified Linux kernel: GCC, Clang/LLVM and Intel's oneAPI. I can't find any other compiler implementation that came close.
cv5005 a day ago
That's because you need to implement a bunch of gcc-specific behavior that linux relies on. A 100% standards compliant c23 compiler can't compile linux.
falcor84 a day ago
anematode a day ago
A simple C89 compiler is a textbook task; a GCC-compatible compiler targeting multiple architectures that can pass 99% of the GCC torture test suite is absolutely not.
wmf a day ago
This has multiple backends and a long tail of C extensions that are not in the textbook.
blibble a day ago
indeed
building a working C compiler from scratch is literally in my "teach yourself C in 24 hours" book from 30 years ago
simonw a day ago
Which book was that? Sounds excellent.
Might have been Compiler Design in C from 1990. Looks like that's available for free now: https://holub.com/compiler/
blibble 21 hours ago
fxtentacle a day ago
You could hire a reasonably skilled dev in India for a week for $1k —- or you could pay $20k in LLM tokens, spend 2 hours writing essays to explain what you want, and then get a buggy mess.
Philpax a day ago
No human developer, not even Fabrice Bellard, could reproduce this specific result in a week. A subset of it, sure, but not everything this does.
falloutx a day ago
just forked https://github.com/Vexu/arocc and it took me 5 seconds to complete it.