How LLMs work (0xkato.xyz)
673 points by 0xkato 3 days ago
malwrar 13 hours ago
Back when ChatGPT came out, I was so shocked by how _good_ it was for an “AI” product that I simply had to know how it worked. Over the next month I ended up drawing out a block diagram on a whiteboard I have in my office, with the math involved next to each step in the blackboard. I’d puzzle about each step along the way, and the triumph of completing the drawing was also that of this sense of deep understanding. I kept that drawing up for many months after, and would gaze at it often during meetings and idle moments in wonder.
This is to say: the autoregressive decoder-only transformer llm architecture as pioneered by openai is wildly simple for how revolutionary its results are. I was reading about non-learned classical SLAM systems (uses video + handcrafted math to produce 3d mappings of physical spaces while also locating the camera in those spaces) at the time, and comparatively speaking I’d say the math is about as complicated as ONE of the components in those complex formulations. The only reason frontier LLMs need 6-figure computers to run is because the model designers made the middle bit in those models REALLY BIG, dimensionally speaking. They just took the steam engine, made a few gargantuan versions of it, and are selling them as the ultimate source of power.
This was openai’s entire breakthrough. Making this particular model architecture larger leads to emergent capabilities like being able to pick the best ending to a story/set of instructions or answer questions about broad factual knowledge. I’ve been meanwhile watching these AI companies attempt, successfully, to sell this capability as some sort of robot consciousness hand-crafted by supergeniuses. The fact that they are getting away with it is almost as shocking to me as the discovery itself.
ekunazanu 9 hours ago
> This was openai’s entire breakthrough. Making this particular model architecture larger leads to emergent capabilities
Basically, the bitter lesson: https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson...
williamstein an hour ago
This interview https://youtu.be/oWOz2htozfI?si=qdQ0uZRoZOYeThOn from 2 days ago with a top researcher from OpenAI directly addresses the bitter lesson argument and the importance of scaling for the history of their models.
xnx an hour ago
Isn't the bitter lesson basically the same as "The Unreasonable Effectiveness of Data" from 2009?
jfim 12 hours ago
Indeed. It's pretty interesting to realize after implementing GPT-2 that the frontier models are scaled up versions of that, with various tweaks to improve performance, model-wise.
The secret sauce though is all the datasets, RL training, knowledge of what works from doing all kinds of ablation experiments, and a massive compute moat.
gobdovan 10 hours ago
The secret sauce is also having the necessary 'creativity' to not get ceased and desisted into oblivion and jail from all the copyrighted material you trained your model on. Btw, not making a moral judgement, [0] shows Michael and Dalton from YC discussing why Ilya Sutskever had to leave Google to pursue what's now ChatGPT
root-parent 3 hours ago
someguyiguess 3 hours ago
miltonlost 2 hours ago
achrono 11 hours ago
How do we know that today's frontier models are merely scaled up versions of that? Genuine question, since the labs have narrowed what they share over the years to now almost nothing, in terms of how the model was trained and how it works under the hood.
HarHarVeryFunny 3 hours ago
gobdovan 10 hours ago
matusp 10 hours ago
ai_slop_hater 11 hours ago
locknitpicker 2 hours ago
> The secret sauce though is all the datasets, RL training, knowledge of what works from doing all kinds of ablation experiments, and a massive compute moat.
ReAct loops and tool-calling are the critical development feature. They turn a model from something that generates text into something that can independently influence the world around them.
Without agent features, you have just a chatbot.
forestsitter an hour ago
Same. I recall reading a paper by Stephen Wolfram after ChatGPT came out where he goes over how it works and what it does. Such a good piece and really got me going with this stuff. https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-...
antirez 9 hours ago
There is a different way to look at this: that is, actually the Transformer is a minimal complication of what the based model is: in theory the neural network could be just a huge FFN, which is anyway the part of the Transformer that does the heavy lifting. But this would be impossibile to train both numerically and computationally, so the Transformer encodes enough priors for it to work: the causal attention, and the math tricks like the residuals and so forth. But the bottom line of all this is that the Transformer works because of the incredible semantical power of simple/huge FFNs.
dist-epoch 6 hours ago
Isn't that over-simplifying it a bit too much?
You can go another step - a FFN can be simulated on a Turing machine, thus it just exemplifies the incredible semantical power of the Turing machine model of computation. (in fact you don't even need a Turing machine, since there is no looping in one forward pass).
In theory you can run a huge FFN on the tiniest Turing machine, in practice it's much better to run a Transformer on the latest NVIDIA hardware. Or as they say "quantity (performance) has a quality all its own"
musebox35 5 hours ago
zbendefy 6 hours ago
root-parent 4 hours ago
I had the same reaction as you, when I learned in detail, how all this works. But then I also learned about superposition and compressed sensing, and now...I am not so sure anymore...
"Beating Nyquist with Compressed Sensing" - https://youtu.be/A8W1I3mtjp8
crossroadsguy 9 hours ago
What hopes/paths does a mere CS bachelor (not deep into stats/maths), and mid level dev (native mobile only; 10-15 years exp.), have about not only understanding it (maybe not fully) but getting possibly into this as a career? Not expecting churning out models and AI systems from the first weeks/months but entry/employment into this field?
(If I can be honest, and I am not being disparaging about anything lest it might seem so, I am looking at it from a career breakthrough/move perspective rather than an intellectual pursuit.)
malwrar 21 minutes ago
Im also a mere mortal, and after putting a few years into it IMO I’d say people make it much more complicated than it actually is. I failed most of my math courses for lack of interest, but found passion later with the aforementioned SLAM stuff. I have no doubt you or any other programmer could learn this stuff, especially since you can ask ChatGPT clarifying questions.
I have no idea about careers at this point, I’m still doing fancy IT work as my day job I and look away from the future with dread. I also haven’t been looking for new roles on the open job market, so who knows maybe there’s multimillion pay packages for anyone who can articulate how attention works in an interview.
2muchcoffeeman 7 hours ago
I think you need to ask what you actually want to do with the AI.
If you want to be a researcher and come out with the next breakthrough, get ready to go back to school and learn some math.
If you just need to learn how to use it well and build things with it, then you probably just need to have a high level understanding.
Same as programming. I’d bet most programmers have no idea about the physics that makes computers work.
bluerooibos 5 hours ago
sirsinsalot 7 hours ago
LatencyKills 4 hours ago
I have a BS in CS (and have been in the field for 25 years). I couldn't understand the transformer architecture until I built a few myself. Here are the books I worked through. I now feel I have a very good understanding of modern LLMs.
https://www.amazon.com/Build-Large-Language-Model-Scratch/dp...
https://www.amazon.com/Build-DeepSeek-Scratch-Abhijit-Dandek...
wuschel 11 hours ago
Could you perhaps cite the core papers for LLMs beyond „Attention is all you need“?
sigmoid10 11 hours ago
"Attention is all you need" is actually a bad paper if you want to learn about autoregressive LLMs specifically, because it describes a more complicated encoder-decoder architecture while modern LLMs are decoder only. So it's an unnecessarily hard way to get into the subject. "Language Models are Unsupervised Multitask Learners" is probably what you are looking for (aka the GPT-2 paper). This was the first time LLMs really showed what is possible, i.e. they can learn to generalize very well from unstructured data. So no more human labelling necessary, which until then was the primary bottleneck in ML. The paper also lists several key ingredients beyond transformers that are mostly still in place today. This also highlights that there was more to it than just "scaling the transformer algorithm" like many people claim. Most developments since then were about improving training data, until "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" drastically changed the architecture landscape again. Later big developments like thinking/reasoning/chain of thought/inference time compute (whatever you want to call it nowadays) are actually all about training again. They work using the exact same architecture.
redox99 8 hours ago
blackbear_ 10 hours ago
The GPT3 paper is a good starting point
Language Models are Few-Shot Learners https://arxiv.org/abs/2005.14165
I also enjoyed the papers for DeepSeek and GLM for an overview of all the tricks you need to make these things work
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models https://arxiv.org/abs/2512.02556
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models https://arxiv.org/abs/2508.06471
sharma-arjun 10 hours ago
Not a core paper, but I found Formal Algorithms for Transformers [1] (a Google paper from 2022) to have a great pedagogical style.
barrenko 10 hours ago
10GBps 12 hours ago
Yep. It's nearly identical to the neural nets we were using in the 90s. Back then even a supercomputer wasn't big enough or fast enough to do what we do today.
I have to wonder though. Is this all a human brain is? A similar thing to an LLM just scaled exponentially larger. I mean a brain is not just neurons with simple connections to each other. The neurons, axons, dendrites, <insert_unexplained_thing>, etc in a brain are all holding and processing information in different ways and doing it nearly 100% in parallel. That's a really big model.
The biological discoveries show how complex a biological brain actually is. Even the tiny brains in a bee or spider are able to solve puzzles and use tools. That's crazy.
ctolsen 11 hours ago
No, it’s definitely not what a human brain is. That makes very little sense. The ways we interact with language (and thus conceptual memory) is completely and fundamentally different.
rfv6723 10 hours ago
zaphirplane 4 hours ago
redox99 8 hours ago
In the 90s you didn't have norm layers, residuals, attention, and some more.
So you're missing a lot of the building blocks that make LLMs. It's not a matter of just having the compute.
sirsinsalot 7 hours ago
bonoboTP 11 hours ago
Attention layers were not used in the 90s.
spacebacon 10 hours ago
LLMs are semiotic infrastructure. You won’t find a better analogy. The cognitive frame won’t hold.
otabdeveloper4 11 hours ago
> I mean a brain is not just neurons with simple connections to each other.
No, it's not. There are many animals that have extremely complex and even learned behaviour that have literally zero neurons.
Clearly "neurons" is an oversimplification just-so story, not a scientific theory.
adammarples 8 hours ago
formerly_proven 9 hours ago
foxes 12 hours ago
Probably better to not simply reduce it by just saying X is Y then if it has all that extra complexity and capacity.
sesm 7 hours ago
I would argue that those are not emergent property of the model, but a property of how humans find insights in a plausible guess.
GardenLetter27 9 hours ago
It's not just the architecture but also the data - the decoder only approach lets you train in parallel over blocks of text (no RNN serial waiting), that allows you train on much, much more data.
bluerooibos 5 hours ago
Since you spent a month digging into this, can you recommend any materials/projects to look into to get a decent grasp of how they work?
malwrar 11 minutes ago
I’d recommend my method of just drawing out the block diagram and drawing out + digging into the math at each step! I’m the kind of person who needs to take time to ask lots of questions before stuff clicks, and if you are too I strongly recommend it.
I picked it up from trying to teach myself that SLAM stuff. The papers are very short, but highly information dense and at the time there was no ChatGPT to help me. I got through them by just creeping my way through the math with a whiteboard, and something about drawing it out and having it there in my office made it all click. Trying to watch piecemeal lectures on YouTube or grind through foundational books like MVG just didn’t work for me, I used them instead as references for my drawings.
Same happened when I tried learning this GPT stuff. karpathy’s videos were out at the time, but I couldn’t really stay focused on them or connect the math with the code. Most other descriptions I could find were focused on getting you to use their inference library or harness. Assembling the picture together on my whiteboard by focusing on drawing out the block diagram continues to be my personal favorite method for deep understanding of complex systems.
LatencyKills 4 hours ago
Not OP but I worked through Sebastian Raschka's "Build a Large Language Model (From Scratch)" [0] and Raj Abhijit Dandekar's "Build a DeepSeek Model (From Scratch)" [1] books.
I don't think there is anything in a transformer I couldn't explain in the smallest detail now.
[0]: https://www.amazon.com/Build-Large-Language-Model-Scratch/dp...
[1]: https://www.amazon.com/Build-DeepSeek-Scratch-Abhijit-Dandek...
hackinthebochs 4 hours ago
darksim905 12 hours ago
For anyone who is curious about the first paragraph here, this is actually a great video overview of how LLM works and the tokenization part.
Tangentially related: This part always seemed fuzzy to me, especially when dealing with data scientists and how they talk about how 'ML' looks at problems. I had this issue when working at a SIEM vendor where they kept going on about use case development having to be designed a certain way to catch things. It was all very frustrating.
cloche 42 minutes ago
> this is actually a great video overview of how LLM works and the tokenization part
Did you mean to link to the video? I would be interested.
pkoird 12 hours ago
aka "the bitter lesson"
Gmolomo 9 hours ago
Sooooo just because you are able to understand it, it's not worth anything?
It doesn't has any impact?
Ah wait it does. Mh weird.
Why are you not creating a startup and get rich?
sarjann 9 hours ago
I mean there is a little something called compute. And other complexity that comes like writing code to efficiently distribute a model across machines.
dominotw 4 hours ago
> Over the next month I ended up drawing out a block diagram on a whiteboard I have in my office, with the math involved next to each step in the blackboard. I’d puzzle about each step along the way, and the triumph of completing the drawing was also that of this sense of deep understanding. I kept that drawing up for many months after, and would gaze at it often during meetings and idle moments in wonder.
how did you know about the steps and there was math involved. i am curious about your process and you came up with what exactly to learn to unravel the mystery.
coliveira 4 hours ago
Don't forget the stolen data from books and papers. You'll never get anything intelligent without using the stolen data they had access to.
golergka 8 hours ago
After building some toy LLMs on my own I came to realise that architecture is not the hard part. Train is.
dist-epoch 6 hours ago
That's easy to say AFTER you know the architecture.
Einstein special relativity is taught these days in high-schools. Doesn't mean it wasn't the very hard part at some point in time.
As they say, shoulders of giants.
faurroar 12 hours ago
Architectures have evolved significantly since then. DeepSeek v4 =/= GPT-3. Even then, a great deal of complexity lies in everything surrounding the architectures e.g. how do you implement them performantly on modern accelerators, how do you distribute the model across a set of accelerators, how do you post-train, etc. And pre-training itself is a dark art. If you legitimately think that frontier labs are doing something equivalent to whatever you wrote on your whiteboard, you’re clueless.
jumploops 12 hours ago
Those are all just optimizations.
We still don’t really know why they work, we just know how to build them.
trollbridge 12 hours ago
slopinthebag 11 hours ago
otabdeveloper4 11 hours ago
firemelt 9 hours ago
fucking well said
robwwilliams 3 hours ago
Great, and won’t we all be just as surprised when human self-attentional control turns out to be just as simple or just as complex! Our minds as a strange fabric built of threads of recursions without the benefit of any explicit clock.
miki123211 3 hours ago
There's one thing I wish people understood about LLMs, and it doesn't really have anything to do with what's inside the neural network part. It's the fact that LLMs can only write in one direction — forward.
When you are writing an essay and realize midway through a sentence that what you've written doesn't make sense, you go back and edit. An LLM can't do that, the only thing it can do is keep on generating. Because training data typically contains full essays and not half-finished sentences which were then edited, LLMs have a strong preference for "saving face" and producing grammatically correct, internally coherent outputs. They will often do so even if the only way to write themselves out of the corner they wrote themselves into is to lie. To maintain internal coherence, they'll then repeat that lie for the rest of the response.
This is also why changing response structure used to affect LLM performance so dramatically. If you asked an LLM to solve a math problem and all-but-forced it to start with the answer, it would have had to calculate that answer before emitting any tokens, something which it very often wasn't able to do. If it was told to follow up the answer with an explanation, it would produce a plausible-sounding explanation to maintain coherence.
If, on the other hand, it was told to start by "thinking step by step", it would often be able to solve the first step, and then the next one given the results of the first, and so on, until it was able to reach the answer. Because the answer came last, it wasn't committing to anything, so had no reason to "save face" and lie.
This part of the problem is basically solved now with reasoning; reasoning is where all the step-by-step stuff happens, even if users aren't always able to see it. In the process of RLVR, models even train themselves into outputting phrases like "let me check my answer once again" in the chain-of-thought; those serve as their "life rafts" which they can use to both save face and change their answer.
chris_money202 an hour ago
In terms of our brains though we can only think forward as well (if forward is time). Our brain in the future says something we did in the past was wrong (part of the sentence we wrote) and that informs our body (the agent) to go back and fix it
helloplanets 9 hours ago
The part about positional encoding is not correct.
> The intuition: instead of adding position info to each token’s vector, RoPE rotates the vector by an angle that depends on its position
You can't rotate the token's entire vector (or all three vectors, whatever is being implied is unclear). You rotate each token's Query and Key vectors only, so dot product can be used to tell how far apart the tokens are when comparing token 1's Query vector to token 2's Key vector.
Positional embedding should just be explained after explaining the Query, Key and Value vectors. When the article explains those only after that, the reader is building up on a wrong intuition and it gets confusing.
oceansky an hour ago
Out of curiosity, I wondered if you could break a tokenizer by introducing weird characters not mapped to an id.
But apparently, they either just emit a [UNK] token or translate the unrecognized character into raw UTF-8 bytes.
10GBps 14 hours ago
I learned TCP/IP by watching and reading raw packets over packet radio at 1200 baud.
I've noticed the same thing is possible if you watch the output of a slow LLM. Eventually you start to see the machinery. input tokens = output tokens, it's math. I can't exactly predict the tokens generated but I can see how they are formed. It's a lot like chess. You can't see every possible move but the mechanism is understandable.
trollbridge 12 hours ago
Comment <-> username synergy.
helloplanets 8 hours ago
It's basically possible build an LLM using just routers+packets, and then hook them up to Wireshark to see it compute!
Maledictus 11 hours ago
How would I set this up?
barrenko 10 hours ago
I'd recommend to maybe also specifically watching Karpathy's videos and focusing on the early parts where he specifically deals with tokenization / embeddings generation (which gets really overlooked), and he does this in most of his videos.
fragmede 11 hours ago
https://distill.pub/2019/activation-atlas/
I can only imagine what sort of visualizations are going on today inside of the AI labs.
alecco 6 hours ago
A better blog on Transformers: https://www.aleksagordic.com/blog/transformer
vocram 10 hours ago
Saying an article is of inferior quality just because editing was AI-assisted is like saying a book is lower quality just because it was printed rather than written by hand
Ampersander 8 hours ago
You are exactly right! People do not find the writing obnoxious, they are backwards technophobes getting brought down by their superstitions.
lateral_cloud 9 hours ago
AI assisted is a stretch. And that analogy isn't even close to being relevant
bspammer 9 hours ago
No? One affects the actual text and the other doesn’t.
possibleworlds 4 hours ago
This analogy makes absolutely no sense.
Laurel1234 9 hours ago
Rather interesting than clanker slop defenders downplay the clanker aspect and highlight the human by calling it "ai-assisted", which defeats their entire point.
I hope you do some introspection and start consciously recognizing that the human input and the clanker slop is just debasing it.
janalsncm 9 hours ago
Not just that, I think a lot of people are going to waste their time losing the battle (and make no mistake, they will lose) fighting against AI writing without ever asking themselves what makes writing good in the first place.
There’s good AI writing and bad organic writing. But it’s easier to point out a few LLM-isms than to actually identify the problems with text.
blharr 3 hours ago
> There's good AI writing
Sure, but the LLM-isms in AI writing are mentally exhausting to see in every way at this point.
The whole point of reading, frankly, is to understand the voice of other people. When you pass that through a distorted filter that makes everyone sound the same... its bad, lossy, frustrating communication
It's also dishonest. When you publish something that is direct output without your wording. Digital catfishing at best.
The only good AI writing is providing the prompt, because the question is way more interesting, and way more constructive to learning than the answer
zenfoxai 2 hours ago
Nice article but chain of thought is what makes frontier LLMs smart, not really the token loop
brcmthrowaway 30 minutes ago
Is chain of thought same as test time compute?
agumonkey an hour ago
Nice intro, gonna help me dig further a lot now. Thanks a ton.
andai 14 hours ago
I couldn't load the article directly due to an SSL issue, so here's the archive link:
whyage 4 hours ago
Style nit: the transitions between dark-mode text and large diagrams with a snow white background are jarring.
AltruisticGapHN 9 hours ago
I don't like how most LLM explainer articles and videos say that essentially a LLM " predicts the next word".
I'm a developer but not very good at maths and I still don't understand any of it.
A LLM clearly has some "visual" capacity. You ask Gemini to build something with Canvas and it's able to reason about the shape of things. Like recently I waanted a checkbox that has like a gradient flowing around the edge. It figured out it could use a radial gradient from the center of the checkbox, and overlay that with a small inner div so you only see the edge that looks like the gradient is circling around the checkbox.
How is that "predicting the next word"?
Not saying AI is intelligent or conscious or anything like that, but the algorithm clearly is far more complex than "predicting words".
What I mean, is the LLM is able to represent things in space . That part I don't understand.
I also still dont understand the relationship between the chat based LLM and the multi modal stuff. I think I read somewhere when image is generated it is also tokens?
dev_hugepages 8 hours ago
Predicting a word is the final objective, as in the output of the model is a probability distribution of the next token. However, choosing the right token is more complicated than just regurgitating the training data (and you won't encounter an exact example in the training data, so you need to interpolate). This makes the model learn abstract representation of things that it is able to manipulate before outputting this back into token. RL also complicates this because the "fitness" is now some arbitrary metric computed over an entire sequence of tokens.
Borealid 9 hours ago
Your casual understanding is imprecise.
At all times the LLM is, indeed, predicting the next token. Anything it does emerges from that.
It did not "figure anything out". It predicted that text describing the use of a radial gradient was likely to follow text describing your problem.
hackinthebochs 4 hours ago
>At all times the LLM is, indeed, predicting the next token
The point is that saying they're just "predicting the next token" is not at all explanatory nor providing insight. Saying the brain is just firing action potentials gives you no understanding about how the brain does what it does or what the space of its capabilities are. Similarly, predicting the next token tells you nothing about the capabilities of LLMs.
layla5alive 7 hours ago
Lol, the bird did not 'fly' - it just flapped its wings and generated lift!
Borealid 4 hours ago
qsera 7 hours ago
raincole 3 hours ago
> What I mean, is the LLM is able to represent things in space . That part I don't understand.
Why do you think this is mutually exclusive to "LLM predicts the next token"?
If you tell someone from 19th century that bytes (just 0s and 1s!) can represent an opera, a song, or even a whole interactive experience, they might be really confused. But there is no reason they can't.
If you tell someone without math background that the sums of smaller and smaller sin waves can represent pretty much anything in our universe, they might be really confused. But there is no reason they can't.
There is simply no reason that a next-token predicator can't generate a nice-looking checkbox.
nchie 8 hours ago
I understand that to be the "emergent abilities" which are spoken about. There are correlations in the dataset that are strong enough for it to seem to have an understanding which wasn't obvious it would have from simply "predicting the next word".
antran22 9 hours ago
It's still predicting the next word. Somewhere in the gigantic dataset that the LLM was trained on, there is a phrase that says "gradient border" being in the vicinity of a CSS code that render the stuff. Therefore when you run it on an inference loop there's a good chance it output that CSS code when you tell it to render a "gradient border"
Multi-modal models that can understand visual input do exists, but no such visual reasoning process happened in the example you mentioned. Not unless you have a visual feedback loop in the coding harness.
I'm not dismissing the capability of "predicting the next word" however. The vast amount of training data enable extremely complex and useful behavior you just described.
360MustangScope 7 hours ago
What about things it wasn’t trained on?
For instance I’ve written a few custom languages to learn how to write a VM and the lexer/parser/compiler/etc. that it had never seen before and then just gave it the syntax which is different than what it had ever seen before. Simply due to the fact I made it and it had never been trained on it.
After giving it my documentation, it was able to write the language just like a language that it had been trained on. I’ve also seen this behavior at work where there are weird quirks to do things and definitely not standard and it can handle it.
qsera 7 hours ago
skydhash 5 hours ago
YeGoblynQueenne 3 hours ago
Marha01 7 hours ago
LLMs fundamentally work by predicting the next word (token). But that should not be used to diminish their potential capabilities. It's like saying that human brains "just predict (or produce) the next electrical impulse". Fundamentally correct, but says nothing about the potential emergent capabilities of scaled-up systems that work like that.
Emergent properties of complex systems should not be diminished just because the underlying operating principle is simple.
YeGoblynQueenne 4 hours ago
Sorry you're being downvoted for asking a very reasonable question. I don't think any of the replies here address your question either.
If I can do my best to answer, Gemini is a multi-modal system. That means it's trained not only on text but also still images, video and also sound. The training happens in parallel and the representation of each modality is usually different, so the image recognition part is not trained on text tokens but pixels, the video part (probably) on video frames etc. There is some kind of integrated training that goes on so that text can be generated that is correlated to an image and so on, but I don't know the specifics about Gemini in particular. This kind of thing is not exactly new either, you can find systems that captioned images before the rise of LLMs simply by training on examples of images coupled to their textual descriptions.
In that sense it's not entirely correct to call Gemini an "LLM" because it's not only a "language" (or, more precisely, text) model. But LLM I guess becomes a bit of a shorthand for everything based on, or combined with, an LLM.
Anyway that's what's going on: it's not just predicting the next word. It's also predicting the next image frame or the next set of pixels etc associated with the next word.
qsera 7 hours ago
>is the LLM is able to represent things in space
It is imitating the text written by humans who can represent things in space.
MagicMoonlight 6 hours ago
It can’t. It’s like a Redditor, it just repeats what it has seen other people say.
It has read all of stackoverflow, so it has seen your kind of problem before. Try asking it something really unusual and it will shit the bed.
mjmsmith 3 hours ago
> It’s like a Redditor, it just repeats what it has seen other people say.
Can stochastic parrots understand irony?
Ampersander 8 hours ago
I do agree bigly. Calling what is basically a superhuman brain inside a computer just a "token predictor" is peak thinkslop.
otabdeveloper4 3 hours ago
Inside the magic AI box is literally nothing but this loop:
int n_tokens = 0;
while (n_tokens < TOKENS_MAX) {
int next_token = decode(context, ++position);
print(token_to_text(next_token));
++n_tokens;
}
If you don't believe me then just download llama.cpp and see for yourself.locallost 7 hours ago
I don't want to pretend I can explain LLMs, but the same "math" can be applied for visual and non visual things. The dot product of two vectors gives you the angle between them. This is true in 2 or 3 dimensions. But it's also true in 4, 5, 6...n dimensions even though we cannot visualize a 4d space. That it's an angle is relevant for you in the space you can comprehend, but for math or a machine it works in any number of dimensions. So it does need to understand anything visually if the math checks out.
throw310822 8 hours ago
LLMs are modelled to predict the next token, and are indeed trained to do so on enormous bodies of text. But to be really good at predicting the next token (word) at the end of a long string of text, you must understand what the text means. If I give you the entire text of a long novel and at the end ask you a single "yes/ no" question about the plot, you only need to emit a single token, but emitting the correct one implies having understood the plot of the novel. This is what LLMs do. They're generating meaningful, coherent text, which implies understanding and cognition at a level that is much deeper than that of the single token they generate at each forward pass. Internally, the LLM has learned to represent the meaning of the entire prompt text, the concepts it implies and its possible continuations far beyond the horizon of simply outputting the next token.
otabdeveloper4 3 hours ago
> This is what LLMs do. They're generating meaningful, coherent text
No, they generate grammatically coherent text. That is because human language grammars are fundamentally mathematical structures that can be approximated with matrix operations.
They don't generate meaningful text because they have no inherent knowledge of the world.
If you've used LLMs for any amount of time you've already noticed how often they get confused about numeric quantities - like confusing notions of "bigger than" and "less than" or being unable to count letters in words.
This is because any meaning in their output is only accidental.
yukIttEft 8 hours ago
> so the model figures out during training what each token should look for and what it should offer
But how does it learn this token-relationship?
All it has is many text samples, but still, nowhere it says how the tokens relate to each other, so where does this information come from?
HarHarVeryFunny 2 hours ago
The model is just trying to map from sequence to next token. You could say that it doesn't really care about the relationships between words/tokens - it is just being trained to learn the best attention/etc weights to make this mapping as accurate as possible.
The model could just as well learn to predict next token from gibberish text as long as there were some statistical gibberish regularities to learn. However, if you train it on real meaningful text then the statistical regularities it needs to learn (and will, thanks to gradient descent, and the capable architecture) will be those reflecting "token relationships" - grammar, semantics, etc.
So, you can say the "token relationships" (incl word meanings) are reflected in the statistical regularities of the training data, and the model architecture and training algorithm are just very capable of learning those regularities whatever they may be.
You can consider it related to Word2Vec word embeddings, which are based on the idea that the meaning of words comes from how they are used, which to a first approximation can be implemented by considering the meaning of words to be defined by the words they appear next to(!), which is what the Word2Vec embedding training algorithm does, and famous examples such as "(king - man) + woman = queen" prove that this is in fact learning the meanings of words.
inkysigma 8 hours ago
At a high level, the text samples are how the relationships are derived. If we treat text samples as sequences of tokens, then the sequences of tokens describe the joint distributions they occur together which confers the relationship between them. Iirc, this is related to the idea of the distributional hypothesis in NLP: the idea the semantics of words should be similar if they occur in similar situations.
MagicMoonlight 6 hours ago
If I handed you thousands of documents which said “Jan-Michael Vincent” all over them, would you need to understand who that is in order to notice the relationship there?
dist-epoch 6 hours ago
How does evolution learn the form-fitness relationship?
It's the same thing here, you randomly try various token-relationship values and the ones which are slightly better will be favoured.
melvinroest 10 hours ago
I thought Karpathy’s microgpt explain how LLMs work
disgruntledphd2 10 hours ago
Microgpt is really good, if you want to understand exactly what happens. I still thought that this article was a good, higher-level complement to that article though.
stalfie 9 hours ago
This article describes how Transformers work, but not really how LLMs work. Explaining the underlying architecture gives you about as much insight into how a modern LLM behaves as an breakdown of neuronal biochemistry and a few pathways does for the brain. Meaning, almost no insight at all.
rishbz 7 hours ago
Great insights. RL training is the key
lionkor 10 hours ago
It sucks that this article is clearly LLM edited, with common phrases like "same shape as", "the intuition: ", and the "tiny explainer" which clearly generalized from a prompt accidentally.
Good article, but when sharing it I will have to preface "yes it's slop, but it's a good explanation".
Absolutely embarrassing that the author didn't catch that these LLM-isms are a (and here I'll use one) bad signal.
In fact, I would go so far as to say that publishing in this style stems from a lack of reading experience and writing experience, which does not bode well for someone pretending to be an expert. I gave this article to someone highly intelligent who doesn't know the first thing about how LLMs work internally, and she immediately called out that it reads like AI text.
Ampersander 8 hours ago
You're not supposed to read it, just like you're not supposed to write anything anymore. Claude can read and write more than any human. We just lean back and relax now.
janalsncm 9 hours ago
I don’t think it’s absolutely embarrassing. First of all, the point of the author writing at all is to aid understanding, not produce prose. So from that standpoint, what would be embarrassing would be to include incorrect facts that suggest a fundamental misunderstanding of the topic.
From my read, it is fine. The brief history of LLMs is complicated since every single component has papers introducing enhancements. So it’s easy to ignore them or get bogged down with details.
The author appears to be a security researcher learning about LLMs for the purpose of defending against common attacks. So this piece is that person giving themselves a crash course on the topic. The fact that they cleaned up their notes with an LLM is frankly completely irrelevant.
spacebacon 10 hours ago
But how do they “think”? This is the only repo that can tell you that.
mathisdev7 3 hours ago
very interesting and useful!
aabdi 11 hours ago
this is hard to read...
it goes all over the place.
i'm not actually sure who your target audience is.
there's too many side tangents.
just like, structure it plz.
1. customer feels bad cuz they don't understand how llms work
2. provide high level abstracted explanation (don't dive into concepts yet)
3. provide breakdown guide of overall set of components.
4. walk through each component. don't side track. no need to explain, ROPE,GQA etc... it just distracts.
i.e. customers don't know how llms work, leading them to feel bad about their own intelligence.
at a high level llms take in words, do some math on them, and then produce words, one by one.
inside llms have these different components. we walk through them step by step.
1. tokenizer
2. embedding
3. attention
4. heads
5. ffn
6. sampling
## tokenizer
barrenko 10 hours ago
It's just slop.
lhd1 13 hours ago
find it difficult to engage with AI generated text. What am I getting here that I couldn't get from a chatbot.
blackoil 13 hours ago
Hopefully someone has asked right questions and removed confusing answers/hallucinations.
dialsMavis 13 hours ago
Is this text generated by AI? I couldn't tell but I'd believe it if it was.
I imagine if resources were spent writing this text then one benefit of using it is not using more resources or the pollution caused from a chatbot.
zemo 12 hours ago
normal people talk and write with some notion of meter, the cadence of communicating where pauses are inserted at places that naturally suit the speaker (and listener) to pause for thought. LLM's don't really do that, they just write a bunch of sentences.
> Researchers have found that some neurons inside the FFN are strongly associated with specific concepts or facts. One neuron might activate strongly on Eiffel-Tower-related text. Another on programming languages. Another on past-tense verbs.
People don't really write like this and they don't really talk like this (and no, people don't necessarily write exactly how they talk because they don't read exactly how they listen; the written word can be backtracked while the heard cannot, and speakers/writers know this, either consciously or unconsciously). A person would probably structure this more like:
> Researchers have found that some neurons inside the FFN are strongly associated with specific concepts or facts. For example, there could be one neuron that activates strongly on Eiffel-Tower-related text, another that activates strongly on programming languages, a third neuron activating on past-tense verbs, and so on.
Usually people wouldn't write "Another on programming languages." as a standalone sentence like that because the periods introduce an unnatural pause like they're giving a TED talk, unless of course they were punctuating that way for effect, but you'd essentially never communicate with that effect full time.
mattnewton 11 hours ago
rippeltippel 13 hours ago
The voice of several passages resembles ChatGPT very closely.
cubefox 10 hours ago
We are living in a crazy science fiction world where on the top of the HN frontpage there is an article on how LLMs work which is likely itself LLM generated, and the only way to tell is its writing style rather than its factual accuracy.
lateral_cloud 10 hours ago
I don't understand how these AI written articles get so many votes.
alansaber 3 hours ago
There is a very high volume of them being posted every day, and they are a significant % of the total. Also, writing is hard, LLM articles can be slop whilst also being better written than average.
singpolyma3 14 hours ago
Next do "why LLMs work"
inkysigma 8 hours ago
This is essentially an open research question. ML theory is unfortunately very weak relative to where the empirics are. I think there's a relatively optimistic paper that was posted a while back here but I would also take it with a grain of salt.
https://arxiv.org/abs/2604.21691
There's of course empirical results and relatively weak theoretical results like the UAT but I also don't think that answers your question fully, especially since it seems impossible to definitively answer questions that the industry seems to betting on like whether or not there is a lower bound to their error rate or whether hallucination as a problem can be solved. We have much stronger ideas of what linear regression is doing relative to what LLMs are doing.
krackers 12 hours ago
See Tegmark's "why does deep cheap learning work so well" (well not so cheap anymore...)
https://www.youtube.com/watch?v=5MdSE-N0bxs is remarkably prescient given that it was written before LLMs
sheeshkebab 14 hours ago
considering they work with any architecture/configuration given enough compute, just more or less efficiently - then maybe it's fundamental, in the same sense as why electricity works...
soupspaces 14 hours ago
Universal approximation theorem, embeddings, self-attention, gradient descent. And empirically, scaling laws.
qsera 6 hours ago
Because there are patterns everywhere!
skydhash 13 hours ago
Why does linear regression works? Why does computer works? Because it's about math and the encoding information. If we can encode words as numbers, then why can't we encode their order as a relation? It's just that neural networks are very apt at finding that relation even if it's noisy.
whateveracct 12 hours ago
accidentally quadratic
codeakki 12 hours ago
What's the point of this? Im not here to engage with AI bots