Gemini 3 Pro vs. 2.5 Pro in Pokemon Crystal (blog.jcz.dev)
272 points by alphabetting 5 days ago
orbital-decay 13 hours ago
The baked-in assumptions observation is basically the opposite of the impression I get after watching Gemini 3's CoT. With the maximum reasoning effort it's able to break out of the wrong route by rethinking the strategy. For example I gave it an onion address without the .onion part, and told it to figure out what this string means. All reasoning models including Gemini 2.5 and 3 assume it's a puzzle or a cipher (because they're trained on those) and start endlessly applying different algorithms to no avail. Gemini 3 Pro is the only model that can break the initial assumption after running out of ideas ("Wait, the user said it's just a string, what if it's NOT obfuscated"), and correctly identify the string as an onion address. My guess is they trained it on simulations to enforce the anti-jailbreaking commands injected by the Model Armor, as its CoT is incredibly paranoid at times. I could be wrong, of course.
jug 10 hours ago
I've had some weird "thinking outside the box" behavior like this. I once asked 3 Pro what Ozzy Osbourne is up to. The CoT was a journey, I can tell you! It's not in its training data that he actually passed away. It did know he was planning a tour though. It had a real struggle trying to consolidate "suspicious search results" and even questioned whether it was fake news, or running against a simulation!, determining it wasn't going to fall for my "test".
It did ultimately decide Ozzy was alive. I pushed back on that, and it instantly corrected itself and partially blamed my query "what is he up to" for being formulated as if he was alive.
Wowfunhappy 6 hours ago
Odd, mine didn't do anything interesting.
bbondo 16 hours ago
1.88 billion tokens * $12 / 1M tokens (output) suggests a total cost of $22,560 to solve the game with Gemini 3 Pro?
elephanlemon 16 hours ago
“Gemini 3 Pro was often overloaded, which produced long spans of downtime that 2.5 Pro experienced much less often”
I was unclear if this meant that the API was overloaded or if he was on a subscription plan and had hit his limit for the moment. Although I think that the Gemini plans just use weekly limits, so I guess it must be API.
mkoubaa 16 hours ago
I can't believe how massively underpaid I was when I was 11
re-thc 16 hours ago
Do you hallucinate as a kid?
foundddit 15 hours ago
nomel 14 hours ago
mkoubaa 13 hours ago
anal_reactor 10 hours ago
brianwawok 16 hours ago
True though I bet the $200 a month plan could do it, maybe a few extra days of downtime when quota was maxed
AstroBen 15 hours ago
For how long would it stay $200 of you can rack up 5 figures if usage..
manmal 15 hours ago
jchw 6 hours ago
This is exactly why I upgrade to the Pixel 10 Pro. On Black Friday, you could get a Pixel 10 Pro for about $450 on the U.S. Google Fi store (which sells unlocked phones)... which is also about how much a Pixel 9 Pro goes for on eBay; minus eBay fees and accounting for shipping, that's an upgrade for < $100. But, it's even better, the Pixel 10 Pro comes with a year of their "AI Pro" plan (which I believe costs around $240/year.) There is really, really no point in upgrading to a Pixel 10 Pro from a Pixel 9 Pro, and environmentally it pains me to be the person upgrading my phone on an annual basis (this is the fastest I've ever upgraded a phone, ever) but it's hard to turn down when Google is selling $800~ish for $400~ish.
And yeah, it's not the insanely priced AI Ultra plan, but if there are any hard limits on Gemini Pro usage I haven't found them. I have played a lot with really long Antigravity sessions to try to figure out what this thing is good for, and it seems like it will pretty much sit there and run all day. (And I can't really blame anyone for still remaining mad about AI to be completely honest, but the technology is too neat by this point to just completely ignore it.)
Seeing as Google is still giving away a bunch of free access, I'm guessing they're still in the ultra-cash-burning phase of things. My hope (hopium, realistically) is that by the time all of the cash burning is over, there will be open-weight local models that are striking near where Gemini 3 Pro strikes today. It doesn't have to be as good, getting nearby on hardware consumers can afford would be awesome.
But I'm not holding my breath, so let's hope the cash burning continues for a few years.
(There is, of course, the other way to look at it, which is that looking at the pricing per token may not tell the whole story. Given that Google is running their own data centers, it's possible the economic proposition isn't as bad as it looks. OTOH, it's also possible it is worse than it looks, if they happen to be selling tokens at a loss... but I quite doubt it, given they are currently SOTA and can charge a premium.)
addaon 13 hours ago
To beat it, not to solve it. Solving means something very specific in the context of games — deriving and proving a GTO strategy.
echelon 5 hours ago
Who is paying for this?
Did the streamer get subsidized by Google?
(The stream isn't run by Google themselves, is it?)
emp17344 2 hours ago
If you go to the X page linked on the blog, the page owner mentions a “collaboration” with Google Deepmind on this project. It wouldn’t shock me if this just an elaborate advertisement for Gemini.
ogogmad 16 hours ago
:/ Damn. That needs to cost 1000x less before people can try it on their own games.
someperson 15 hours ago
That's an extrapolation to finish the entire game.
If limit your token count to a fraction of 2 billion tokens, you can try it on your own game, and of course have it complete a shorter fraction of the game.
oceansky 17 hours ago
"Crucially, it tells the agent not to rely on its internal training data (which might be hallucinated or refer to a different version of the game) but to ground its knowledge in what it observes. "
Does this even have any effect?
ragibson 17 hours ago
Yes, at least to some extent. The author mentions that the base model knows the answer to the switch puzzle but does not execute it properly here.
"It is worth noting that the instruction to "ignore internal knowledge" played a role here. In cases like the shutters puzzle, the model did seem to suppress its training data. I verified this by chatting with the model separately on AI Studio; when asked directly multiple times, it gave the correct solution significantly more often than not. This suggests that the system prompt can indeed mask pre-trained knowledge to facilitate genuine discovery."
hypron 16 hours ago
My issue with this is that the LLM could just be roleplaying that it doesn't know.
jdiff 16 hours ago
stavros 7 hours ago
brianwawok 16 hours ago
tootyskooty 17 hours ago
I'm wondering about this too. Would be nice to see an ablation here, or at least see some analysis on the reasoning traces.
It definitely doesn't wipe its internal knowledge of Crystal clean (that's not how LLMs work). My guess is that it slightly encourages the model to explore more and second-guess it's likely very-strong Crystal game knowledge but that's about it.
Workaccount2 16 hours ago
The model probably recognizes the need for a grassroots effort to solve the problem, to "show it's work".
raincole 16 hours ago
It will definitely have some effect. Why won't it? Even adding noise into prompts (like saying you will be rewarded $1000 for each correct answer) has some effect.
Whether the 'effect' something implied by the prompt, or even something we can understand, is a totally different question.
blibble 17 hours ago
I very much doubt it
baby 16 hours ago
Do we have examples of this in promps in other contexts?
elif 14 hours ago
I would imagine that prompting anything like this will have an excessively ironic effect like convincing it to suppress patterns which it would consider to be pre-knowledge.
If you looked inside they would be spinning on something like "oh I know this is the tile to walk on, but I have to only rely on what I observe! I will do another task instead to satisfy my conditions and not reveal that I have pre-knowledge.
LLMs are literal douche genies. The less you say, generally, the better
astrange 16 hours ago
If they trained the model to respond to that, then it can respond to that, otherwise it can't necessarily.
oceansky 16 hours ago
I think you got a point here. These companies are injecting a lot of datasets every day into it.
astrange 6 hours ago
mkoubaa 16 hours ago
It might get things wrong on purpose, but deep down it knows what it's doing
soulofmischief 17 hours ago
Nice writeup! I need to start blogging about my antics. I rigged up several cutting edge small local models to an emulator all in-browser and unsuccessfully tried to get them to play different Pokémon games. They just weren't as sharp as the frontier models.
This was a good while back but I'm sure a lot of people might find the process and code interesting even if it didn't succeed. Might resurrect that project.
giancarlostoro 17 hours ago
I have to think they need to know enough of the guides for the game for it to work out, how do they know whats on screen?
soulofmischief 16 hours ago
In my project I rigged up an in-browser emulator and directly fed captured images of the screen to local multimodal models.
So it just looks right at what's going on, writes a description for refinement, and uses all of that to create and manage goals, write to a scratchpad and submit input. It's minimal scaffolding because I wanted to see what these raw models are capable of. Kind of a benchmark.
giancarlostoro 14 hours ago
krige 2 hours ago
As a fun comparison, Gemini 3 Pro took 17 days to beat the game. Twitch Plays Pokemon, which was frequently random, chaotic, even malicious, took 13 days to clear Crystal.
cg5280 15 hours ago
I like the inclusion of the graph at the end to compare progress. It would be cool to compare this directly to competing models (Claude, GPT, etc).
kqr 15 hours ago
It would unfortunately also need several runs of each to be reliable. There's nothing in TFA to indicate the results shown aren't to a large degree affected by random chance!
(I do think from personal benchmarks that Gemini 3 is better for the reasons stated by the author, but a single run from each is not strong evidence.)
casey2 15 hours ago
TFA says multiple times that the results are affect by random chance
kqr 15 hours ago
squimmy26 16 hours ago
How certain can we be that these improvements aren't just a result of Gemini 3 Pro pre-training on endless internet writeups of where 2.5 has struggled (and almost certainly what a human would have done instead)?
In other words, how much of this improvement is true generalization vs memorization?
zurfer 16 hours ago
You're too kind. Even the CEO of Google retweeted how well Gemini 2.5 did on Pokemon. There is a high chance that now it's explicitly part of the training regime. We kind of need a different kind of game to know how well it generalizes.
kqr 14 hours ago
I have a draft doing this with text adventures: https://entropicthoughts.com/updated-llm-benchmark
prmoustache 13 hours ago
Isn't that the point of a new model anyway?
DANmode 2 hours ago
Yes. Sort of.
Just don’t confuse it with a random benchmark!
dash2 6 hours ago
> it often makes early assumptions and fails to validate them, which can waste a lot of time
Is this baked into how the models are built? A model outputs a bunch of tokens, then reads them back and treats them as the existing "state" which has to be built on. So if the model has earlier said (or acted like) a given assumption is true, then it is going to assume "oh, I said that, it must be the case". Presumably one reason that hacks like "Wait..." exist is to work around this problem.
topaz0 5 hours ago
Who do I have to talk to to get somebody to pay me thousands of dollars to beat a game from the 90s?
sussmannbaka 14 hours ago
So after years of being gleefully told that AI will replace all jobs an omniscient state of the art model, with heavy assistance, takes more than two weeks and thousands of dollars in tokens to do what child me did in a few days? Huh.
rybosome 13 hours ago
“And, because AI never got any better or any cheaper after that point, sussmanbaka’s wry observation remained true in perpetuity, forever.”
- History, most likely
ehnto 4 hours ago
We will only be able to see where the economic chips lie once all the current money games shake out. It's all a bit obfuscated at the moment.
mchusma 11 hours ago
Cost per intelligence is shrinking by something like 100x per year. Even the Gemini flash release would potentially do as well for 1/5th already.
dwaltrip 13 hours ago
Children are incredibly smart. All of this was fantasy 15 years ago. Comments like yours are amazing to me…
TulliusCicero a few seconds ago
True AI is whatever hasn't been invented yet.
murukesh_s 14 hours ago
I used to think the same until latest agents started adding perfectly fine features to a large existing react app with just basic input (in English) . Most of the jobs require levels of intelligence below that. It's just a matter of time before agents get to that.
blauditore 14 hours ago
It's about the complexity of the task. Front end apps tend do be much less complex and boilerplate-y than backends, hence AI tends to work better.
murukesh_s 13 hours ago
ehnto 4 hours ago
etse 14 hours ago
ribosometronome 13 hours ago
jwrallie 17 hours ago
Being through the game recently, I am not surprised Goldenrod Underground was a challenge, it is very confusing and even though I solved it through trial and error, I still don't know what I did. Olivine Lighthouse is the real surprise, as it felt quite obvious to me.
wild_pointer 17 hours ago
I wonder how much of it is due to the model being familiar with the game or parts of it, be it due to training of the game itself, or reading/watching walkthroughs online.
andrepd 17 hours ago
There was a well-publicised "Claude plays Pokémon" stream where Claude failed to complete Pokemon Blue in spectacular fashion, despite weeks of trying. I think only a very gullible person would assume that future LLMs didn't specifically bake this into their training, as they do for popular benchmarks or for penguins riding a bike.
dwaltrip 12 hours ago
If they game the pelican benchmark, it’d be pretty obvious.
Just try other random, non-realistic things like “a giraffe walking a tightrope”, “a car sitting at a cafe eating a pizza”, etc.
If the results are dramatically different, then they gamed it. If they are similar in quality, then they probably didn’t.
ctoth 14 hours ago
> as they do for popular benchmarks or for penguins riding a bike.
Citation?
criley2 16 hours ago
While it is true that model makers are increasingly trying to game benchmarks, it's also true that benchmark-chasing is lowering model quality. GPT 5, 5.1 and 5.2 have been nearly universally panned by almost every class of user, despite being a benchmark monster. In fact, the more OpenAI tries to benchmark-max, the worse their models seem to get.
astrange 16 hours ago
malnourish 8 hours ago
reilly3000 12 hours ago
I’d love to see how the new flash-3 model would fare.
elif 14 hours ago
Give it the gameFAQ next time