From tokens to thoughts: How LLMs and humans trade compression for meaning (arxiv.org)

117 points by ggirelli a day ago

valine a day ago

>> For each LLM, we extract static, token-level embeddings from its input embedding layer (the ‘E‘matrix). This choice aligns our analysis with the context-free nature of stimuli typical in human categorization experiments, ensuring a comparable representational basis.

They're analyzing input embedding models, not LLMs. I'm not sure how the authors justify making claims about the inner workings of LLMs when they haven't actually computed a forward pass. The EMatrix is not an LLM, its a lookup table.

Just to highlight the ridiculousness of this research, no attention was computed! Not a single dot product between keys and queries. All of their conclusions are drawn from the output of an embedding lookup table.

The figure showing their alignment score correlated with model size is particularly egregious. Model size is meaningless when you never activate any model parameters. If Bert is outperforming Qwen and Gemma something is wrong with your methodology.

blackbear_ a day ago

Note that the token embeddings are also trained, therefore their values do give some hints on how a model is organizing information.

They used token embeddings directly and not intermediate representations because the latter depend on the specific sentence that the model is processing. Data on human judgment was however collected without any context surrounding each word, thus using the token embeddings seem to be the most fair comparison.

Otherwise, what sentence(s) would you have used to compute the intermediate representations? And how would you make sure that the results aren't biased by these sentences?

navar a day ago

You can process a single word through a transformer and get the corresponding intermediate representations.

Though it sounds odd there is no problem with it and it would indeed return the model's representation of that single word as seen by the model without any additional context.

valine a day ago

Embedding models are not always trained with the rest of the model. That’s the whole idea behind VLLMs. First layer embeddings are so interchangeable you can literally feed in the output of other models using linear projection layers.

And like the other commenter said, you can absolutely feed single tokens through the model. Your point doesn’t make any sense though regardless. How about priming the model with “You’re a helpful assistant” just like everyone else does.

boroboro4 a day ago

It’s mind blowing LeCun is listed as one of the authors.

I would expect model size to correlate with alignment score because usually model sizes correlate with hidden dimension. But also opposite can be true - bigger models might shift more basic token classification logic into layers and hence embedding alignment can go down. Regardless feels like pretty useless research…

danielbln a day ago

Leaves a bit of a taste considering LeCun's famously critical stance on auto-regressive transformer LLMs.

throwawaymaths a day ago

the llm is also a lookup table! but your point is correct. they should have looked at subsequent layers that aggregate information over distance.

johnnyApplePRNG a day ago

This paper is interesting, but ultimately it's just restating that LLMs are statistical tools and not cognitive systems. The information-theoretic framing doesn’t really change that.

Nevermark a day ago

> LLMs are statistical tools and not cognitive systems

I have never understood broad statements that models are just (or mostly) statistical tools.

Certainly statistics apply, minimizing mismatches results in mean (or similar measure) target predictions.

But the architecture of a model is the difference between compressed statistics vs. forcing a model to translate information in a highly organized way reflecting the actual shape of the problem to get any accuracy at all.

In both cases, statistics are relevant, but in the latter it's not a particularly insightful way to talk about what a model has learned.

Statistical accuracy, prediction, etc. are basic problems to solve. The training criteria being optimized. But they don't limit the nature of solutions. They both leave problem difficulty, and solution sophistication unbounded.

andoando a day ago

Am I the only one that is lost on how the calculations are made?

From what I can tell this is limited in scope to categorizing nouns (robin is a bird).

fusionadvocate a day ago

Open a bank account. Open your heart. Open a can. Open to new experiences.

Words are a tricky thing to handle.

bluefirebrand a day ago

And that is just in English

Other languages have similar but fundamentally different oddities which do not translate cleanly

suddenlybananas a day ago

Not sure how they're fundamentally different. What do you mean?

bluefirebrand a day ago

thesz 19 hours ago

falcor84 a day ago

I agree in general, but I think that "open" is actually a pretty straightforward word.

As I see it, "Open your heart", "Open a can" and "Open to new experiences" have very similar meanings for "Open", being essentially "make a container available for external I/O", similar to the definition of an "open system" in thermodynamics. "Open a bank account" is a bit different, as it creates an entity that didn't exist before, but even then the focus is on having something that allows for external I/O - in this case deposits and withdrawals.

esafak a day ago

And models since BERT and ELMo capture polysemy!

https://aclanthology.org/2020.blackboxnlp-1.15/

an0malous a day ago

OpenAI agrees

catchnear4321 21 hours ago

incomplete inaccurate off misleading meandering not quite generation prediction removal of superfluous fast but spiky

this isn’t talking about that.

xwat 17 hours ago

Stochastic parrots