Chomsky and the Two Cultures of Statistical Learning (norvig.com)

50 points by atomicnature 5 days ago

intalentive 5 hours ago

This essay is missing the words “cause” and “causal”. There is a difference between discovering causes and fitting curves. The search for causes guides the design of experiments, and with luck, the derivation of formulae that describe the causes. Norvig seems to be confusing the map (data, models) for the territory (causal reality).

gsf_emergency_6 4 hours ago

A related* essay (2010) by a statistician on the goals of statistical modelling that I've been procrastinating on:

https://www.stat.berkeley.edu/~aldous/157/Papers/shmueli.pdf

To Explain Or To Predict?

Nice quote

We note that the practice in applied research of concluding that a model with a higher predictive validity is “truer,” is not a valid inference. This paper shows that a parsimonious but less true model can have a higher predictive validity than a truer but less parsimonious model.

Hagerty+Srinivasan (1991)

*like TFA it's a sorta review of Breiman

tripletao 2 hours ago

This essay frequently uses the word "insight", and its primary topic is whether an empirically fitted statistical model can provide that (with Norvig arguing for yes, in my opinion convincingly). How does that differ from your concept of a "cause"?

musicale 2 hours ago

> I agree that it can be difficult to make sense of a model containing billions of parameters. Certainly a human can't understand such a model by inspecting the values of each parameter individually. But one can gain insight by examing (sic) the properties of the model—where it succeeds and fails, how well it learns as a function of data, etc.

Unfortunately, studying the behavior of a system doesn't necessarily provide insight into why it behaves that way; it may not even provide a good predictive model.

bo1024 4 hours ago

Is this essay from 2011?

barrenko 3 hours ago

Is this bayesian vs. frequentist?

tgv an hour ago

In one word: no.

In more detail: Chomsky is/was not concerned with the models themselves, but rather with the distinction between statistical modelling in general, and "clean slate" models in particular on the one hand, and structural models discovered through human insight on the other.

With "clean slate" I mean models that start with as little linguistically informed structure as possible. E.g., Norvig mentions hybrid models: these can start out as classical rule based models, whose probabilities are then learnt. A random neural network would be as clean as possible.

tripletao 3 hours ago

Here's Chomsky quoted in the article, from 1969:

> But it must be recognized that the notion of "probability of a sentence" is an entirely useless one, under any known interpretation of this term.

He was impressively early to the concept, but I think even those skeptical of the ultimate value of LLMs must agree that his position has aged terribly. That seems to have been a fundamental theoretical failing rather than the computational limits of the time, if he couldn't imagine any framework in which a novel sentence had probability other than zero.

I guess that position hasn't aged worse than his judgment of the Khmer Rouge (or Hugo Chavez, or Epstein, or ...) though. There's a cult of personality around Chomsky that's in no way justified by any scientific, political, or other achievements that I can see.

thomassmith65 2 hours ago

I agree that Chomsky's influence, especially in this century, has done more harm than good.

There's no point minimizing his intelligence and achievements, though.

His linguistics work (eg: grammars) is still relevant in computer science, and his cynical view of the West has merit in moderation.

tripletao an hour ago

If Chomsky were known only as a mathematician and computer scientist, then my view of him would be favorable for the reasons you note. His formal grammars are good models for languages that machines can easily use, and that many humans can use with modest effort (i.e., computer programming languages).

The problem is that they're weak models for the languages that humans prefer to use with each other (i.e., natural languages). He seems to have convinced enough academic linguists otherwise to doom most of that field to uselessness for his entire working life, while the useful approach moved to the CS department as NLP.

As to politics, I don't think it's hard to find critics of the West's atrocities with less history of denying or excusing the West's enemies' atrocities. He's certainly not always wrong, but he's a net unfortunate choice of figurehead.

techsystems 43 minutes ago

He did say 'any known' back in the year 1969 though, so judging it to today's knowns would still not be a justification to the idea's age.

tripletao 18 minutes ago

Shannon first proposed Markov processes to generate natural language in 1948. That's inadequate for the reasons discussed extensively in this essay, but it seems like a pretty significant hint that methods beyond simply counting n-grams in the corpus could output useful probabilities.

In any case, do you see evidence that Chomsky changed his view? The quote from 2011 ("some successes, but a lot of failures") is softer but still quite negative.

dleeftink 2 hours ago

> novel sentence

The question then becomes on of actual novelty versus the learned joint probabilities of internalised sentences/phrases/etc.

Generation or regurgitation? Is there a difference to begin with..?

tripletao 38 minutes ago

I'm not sure what you mean? As the length of a sequence increases (from word to n-gram to sentence to paragraph to ...), the probability that it actually ever appeared (in any corpus, whether that's a training set on disk, or every word ever spoken by any human even if not recorded, or anything else) quickly goes to exactly zero. That makes it computationally useless.

If we define perplexity in the usual way in NLP, then that probability approaches zero as the length of the sequence increases, but it does so smoothly and never reaches exactly zero. This makes it useful for sequences of arbitrary length. This latter metric seems so obviously better that it seems ridiculous to me to reject all statistical approaches based on the former. That's with the benefit of hindsight for me; but enough of Chomsky's less famous contemporaries did judge correctly that I get that benefit, that LLMs exist, etc.

agumonkey 2 hours ago

wasn't his grammar classification revolutionary at the time ? it seems it influenced parsing theory later on

eru 2 hours ago

His grammar classification is really useful for formal grammars of formal languages. Like what computers and programming languages do.

It's of rather limited use for natural languages.

templar_snow 5 hours ago

Dude is literally in the Epstein Files.

pmkary an hour ago

Dude would talk about manufacturing consent, elitist circles, and what Israel is doing with poor Palestinians and then go aboard Israeli-spy, super elitist, consent manufacturing, sex trafficker, rapist, Epstein's private jet. What a total insult to everyone who ever read his things

Epa095 5 hours ago

And?

The article by Peter Norvig is still interesting.

edm0nd 5 hours ago

it's just kinda weird and sus.

honestly, I'm surprised Noam is even still alive (aged 97), he is not long for this world and will be gone very soon.

retrac 4 hours ago

eru 2 hours ago

avmich 5 hours ago

poaching 5 hours ago

Who else in tech/AI did they whale?

mmooss 4 hours ago

Are you implying Norvig is a victim or otherwise not responsible for their choices and actions?