Gemini 3.1 Pro (blog.google)
330 points by MallocVoidstar 7 hours ago
spankalee 5 hours ago
I hope this works better than 3.0 Pro
I'm a former Googler and know some people near the team, so I mildly root for them to at least do well, but Gemini is consistently the most frustrating model I've used for development.
It's stunningly good at reasoning, design, and generating the raw code, but it just falls over a lot when actually trying to get things done, especially compared to Claude Opus.
Within VS Code Copilot Claude will have a good mix of thinking streams and responses to the user. Gemini will almost completely use thinking tokens, and then just do something but not tell you what it did. If you don't look at the thinking tokens you can't tell what happened, but the thinking token stream is crap. It's all "I'm now completely immersed in the problem...". Gemini also frequently gets twisted around, stuck in loops, and unable to make forward progress. It's bad at using tools and tries to edit files in weird ways instead of using the provided text editing tools. In Copilot it, won't stop and ask clarifying questions, though in Gemini CLI it will.
So I've tried to adopt a plan-in-Gemini, execute-in-Claude approach, but while I'm doing that I might as well just stay in Claude. The experience is just so much better.
For as much as I hear Google's pulling ahead, Anthropic seems to be to me, from a practical POV. I hope Googlers on Gemini are actually trying these things out in real projects, not just one-shotting a game and calling it a win.
bluegatty 9 minutes ago
Yes, this is very true and it speaks strongly to this wayward notion of 'models' - it depends so much on the tuning, the harness, the tools.
I think it speaks to the broader notion of AGI as well.
Claude is definitively trained on the process of coding not just the code, that much is clear.
Codex has the same limitation but not quite as bad.
This may be a result of Anthropic using 'user cues' with respect to what are good completions and not, and feeding that into the tuning, among other things.
Anthropic is winning coding and related tasks because they're focused on that, Google is probably oriented towards a more general solution, and so, it's stuck in 'jack of all trades master of none' mode.
raducu 4 minutes ago
> Gemini also frequently gets twisted around, stuck in loops, and unable to make forward progress.
Yes, gemini loops but I've found almost always it's just a matter of interrupting and telling it to continue.
Claude is very good until it tries something 2-3 times, can't figure it out and then tries to trick you by changing your tests instead of your code (if you explicitly tell it not to, maybe it will decide to ask) OR introduce hyper-fine-tuned IFs to fit your tests, EVEN if you tell it NOT to.
jpcompartir 3 minutes ago
Yep, Gemini is virtually unusable compared to Anthropic models. I get it for free with work and use maybe once a week, if that. They really need to fix the instruction following.
karmasimida 4 hours ago
Gemini just doesn’t do even mildly well in agentic stuff and I don’t know why.
OpenAI has mostly caught up with Claude in agentic stuff, but Google needs to be there and be there quickly
swftarrow 27 minutes ago
I suspect a large part of Google's lag is due to being overly focused on integrating Gemini with their existing product and app lines.
onlyrealcuzzo 3 hours ago
Because Search is not agentic.
Most of Gemini's users are Search converts doing extended-Search-like behaviors.
Agentic workflows are a VERY small percentage of all LLM usage at the moment. As that market becomes more important, Google will pour more resources into it.
Macha 3 hours ago
alphabetting 3 hours ago
the agentic benchmarks for 3.1 indicate Gemini has caught up. the gains are big from 3.0 to 3.1.
For example the APEX-Agents benchmark for long time horizon investment banking, consulting and legal work:
1. Gemini 3.1 Pro - 33.2% 2. Opus 4.6 - 29.8% 3. GPT 5.2 Codex - 27.6% 4. Gemini Flash 3.0 - 24.0% 5. GPT 5.2 - 23.0% 6. Gemini 3.0 Pro - 18.0%
kakugawa an hour ago
metadat 9 minutes ago
girvo an hour ago
blueaquilae an hour ago
HardCodedBias 2 hours ago
hintymad 41 minutes ago
My guess is that Gemini team didn't focus on the large-scale RL training for the agentic workload. And they are trying to catch up with 3.1.
ionwake 3 hours ago
Can you explain what you mean by its bad at agentic stuff?
karmasimida 3 hours ago
renegade-otter an hour ago
It's like anything Google - they do the cool part and then lose interest with the last 10%. Writing code is easy, building products that print money is hard.
miohtama 27 minutes ago
s3p 4 hours ago
Don't get me started on the thinking tokens. Since 2.5P the thinking has been insane. "I'm diving in to the problem", "I'm fully immersed" or "I'm meticulously crafting the answer"
foz 3 hours ago
This is part of the reason I don't like to use it. I feel it's hiding things from me, compared to other models that very clearly share what they are thinking.
dumpsterdiver an hour ago
dist-epoch 3 hours ago
That's not the real thinking, it's a super summarized view of it.
Oras 4 hours ago
Glad I’m not the only one who experienced this. I have a paid antigravity subscription and most of the time I use Claude models due to the exact issues you have pointed out.
stephen_cagle 2 hours ago
I also worked at Google (on the original Gemini, when it was still Bard internally) and my experience largely mirrors this. My finding is that Gemini is pretty great for factual information and also it is the only one that I can reliably (even with the video camera) take a picture of a bird and have it tell me what the bird is. But it is just pretty bad as a model to help with development, myself and everyone I know uses Claude. The benchmarks are always really close, but my experience is that it does not translate to real world (mostly coding) task.
tldr; It is great at search, not so much action.
neves 32 minutes ago
Gemini interesting with Google software gives me the best feature of all LLMs. When I receive a invite for an event, I screenshot it, share with Gemini app and say: add to my Calendar.
It's not very complex, but a great time saver
stephen_cagle 15 minutes ago
menaerus an hour ago
I don't know ... as of now I am literally instructing it to solve the chained expression computation problem which incurs a lot of temporary variables, of which some can be elided by the compiler and some cannot. Think linear algebra expressions which yield a lot of intermediate computations for which you don't want to create a temporary. This is production code and not an easy problem.
And yet it happily told me what I exactly wanted it to tell me - rewrite the goddamn thing using the (C++) expression templates. And voila, it took "it" 10 minutes to spit out the high-quality code that works.
My biggest gripe for now with Gemini is that Antigravity seems to be written by the model and I am experiencing more hiccups than I would like to, sometimes it's just stuck.
stephen_cagle 3 minutes ago
cmrdporcupine 6 minutes ago
agentifysh 3 hours ago
Relieved to read this from an ex-Googler at least we are no the crazy ones we are made out to be whenever we point out issues with Gemini
knollimar 5 hours ago
Is the thinking token stream obfuscated?
Im fully immersed
orbital-decay 4 hours ago
It's just a summary generated by a really tiny model. I guess it also an ad-hoc way to obfuscate it, yes. In particular they're hiding prompt injections they're dynamically adding sometimes. Actual CoT is hidden and entirely different from that summary. It's not very useful for you as a user, though (neither is the summary).
ukuina 4 hours ago
FergusArgyll 3 hours ago
cubefox 2 hours ago
slopinthebag 4 hours ago
Hmm, interesting..
My workflow is to basically use it to explain new concepts, generate code snippets inline or fill out function bodies, etc. Not really generating code autonomously in a loop. Do you think it would excel at this?
mikestorrent 35 minutes ago
I think that you should really try to get whatever agent you can to work on that kind of thing for you - guide it with the creation of testing frameworks and code coverage, focus more on the test cases with your human intellect, and let it work to pass them.
slopinthebag 13 minutes ago
jbellis 4 hours ago
yeah, g3p is as smart or smarter as the other flagships but it's just not reliable enough, it will go into "thinking loops" and burn 10s of 1000s of tokens repeating itself.
https://blog.brokk.ai/gemini-3-pro-preview-not-quite-baked/
hopefully 3.1 is better.
nicce 3 hours ago
> it will go into "thinking loops" and burn 10s of 1000s of tokens repeating itself.
Maybe it is just a genius business strategy.
mikestorrent 33 minutes ago
varispeed 2 hours ago
> stuck in loops
I wonder if there is some form of cheating. Many times I found that after a while Gemini becomes like a Markov chain spouting nonsense on repeat suddenly and doesn't react to user input anymore.
sdeiley 39 minutes ago
People underrate Google's cost effectiveness so much. Half price of Opus. HALF.
Think about ANY other product and what you'd expect from the competition thats half the price. Yet people here act like Gemini is dead weight
____
Update:
3.1 was 40% of the cost to run AA index vs Opus Thinking AND SONNET, beat Opus, and still 30% faster for output speed.
https://artificialanalysis.ai/?speed=intelligence-vs-speed&m...
nu11ptr 17 minutes ago
That sounds great, but if Opus generates 20% better code think of the ramifications of that on a real world project. Already $100/month gets you a programmer (or maybe even 2 or 3) that can do your work for you. Insanity. Do I even care if there is something 80% as good for 50% the cost? My answer: no. That said, if it is every bit as good, and their benchmarks suggest it is (but proof will be in testing it out), then sure, a 50% cost reduction sounds really nice.
bluegatty 6 minutes ago
You can pay 1 cent for a mediocre answer or 2 cents for a great answer.
So a lot of these things are relative.
Now if that equation plays out 20K times a day, well that's one thing, but if it's 'once a day' then the cost basis becomes irrelevant. Like the cost of staplers for the Medical Device company.
Obviously it will matter, but for development ... it's probably worth it to pay $300/mo for the best model, when the second best is $0.
For consumer AI, the math will be different ... and that will be a big deal in the long run.
jstummbillig 35 minutes ago
It's not half price or cost effective if it can't do the job, that I am happy to pay twice the price for to get done.
But I agree: If they can get there (at one point in the past year I felt they were the best choice for agentic coding), their pricing is very interesting. I am optimistic that it would not require them to go up to Opus pricing.
fastball 8 minutes ago
We are not at the moment where price matters. All that matters is performance.
willis936 a minute ago
It matters to me. I pay for it and I like using it. I pick my models to keep my spend reigned in.
csmpltn 21 minutes ago
> "People underrate Google's cost effectiveness so much. Half price of Opus. HALF."
Google undercutting/subsidizing it's own prices to bite into Anthropic's market share (whilst selling at a loss) doesn't automatically mean Google is effective.
sdeiley 14 minutes ago
Everybody is subsidizing their prices.
But Flash is 1/8 the cost of sonnet and its not impressive?
mritchie712 23 minutes ago
It's half the price per token. Not all tokens are generated equally.
sdeiley 12 minutes ago
Neither are cars but Ill take a Porsche over a Ferrari for a fraction of the price.
Decabytes 33 minutes ago
Any tips for working with Gemini through its chat interface? I’ve worked with ChatGPT and Claude and I’ve generally found them pleasant to work with, but everytime I use Gemini the output is straight dookie
londons_explore 4 minutes ago
make sure you use ai studio (not the vertex one), not the consumer gemini interface. Seems to work better for code there.
cyanydeez 30 minutes ago
Some people like blackjack and a technical edge with card counting, others just say screw it and do slot machines.
Svoka 31 minutes ago
While price is definitely important, results are extremely important. Gemini often falls into the 'didn't do' it part of the spectrum, this days Opus almost always does 'good enough'.
Gemini definitely has its merits but for me it just doesn't do what other models can. I vibe-coded an app which recommends me restaurants. The app uses gemini API to make restaurants given bunch of data and prompt.
App itself is vibe-coded with Opus. Gemini didn't cut it.
sdeiley 6 minutes ago
The binary you draw on models that havent been out a quarter is borderline insane.
Opus is absurdly good in Claude code but theres a lot of use cases Gemini is great at.
I think Google is further behind with the harness than the model
SV_BubbleTime 31 minutes ago
Well, it’s half if the product is equal.
Is it? Honestly, I still chuckle about black Nazis and the female Indian Popes. That was my first impression of Gemini, and first impressions are hard to break. I used Gemini’s VL (vision) for something and it refused to describe because it assumed it was NSFW imagery, which is was not.
I also question statis as an obvious follow up. Is Gemini equal to Opus? Today? Tomorrow? Has Google led the industry thus far and do I expect them to continue?
Counterpoint to that would be that with natural language input and output, that LLM specific tooling is rare and it is easy to switch around if you commoditize the product backend.
varispeed 19 minutes ago
If something is shit, it doesn't matter it costs half price of something okay.
sheepscreek 15 minutes ago
If it’s any consolation, it was able to one-shot a UI & data sync race condition that even Opus 4.6 struggled to fix (across 3 attempts).
So far I like how it’s less verbose than its predecessor. Seems to get to the point quicker too.
While it gives me hope, I am going to play it by the ear. Otherwise it’s going to be - Gemini for world knowledge/general intelligence/R&D and Opus/Sonnet 4.6 to finish it off.
sigmoid10 8 minutes ago
For me it's Opus 4.6 for researching code/digging through repos, gpt 5.3 codex for writing code, gemini for single hardcore science/math algorithms and grok for things the others refuse to answer or skirt around (e.g. some security/exploitability related queries). Get yourself one of those wrappers that support all models and forget thinking about who has the best model. The question is who has the best model for your problem. And there's usually a correct answer, even if it changes regularly.
xrd 4 hours ago
These models are so powerful.
It's totally possible to build entire software products in the fraction of the time it took before.
But, reading the comments here, the behaviors from one version to another point version (not major version mind you) seem very divergent.
It feels like we are now able to manage incredibly smart engineers for a month at the price of a good sushi dinner.
But it also feels like you have to be diligent about adopting new models (even same family and just point version updates) because they operate totally differently regardless of your prompt and agent files.
Imagine managing a team of software developers where every month it was an entirely new team with radically different personalities, career experiences and guiding principles. It would be chaos.
I suspect that older models will be deprecated quickly and unexpectedly, or, worse yet, will be swapped out with subtle different behavioral characteristics without notice. It'll be quicksand.
simonw 4 hours ago
I had an interesting experience recently where I ran Opus 4.6 against a problem that o4-mini had previously convinced me wasn't tractable... and Opus 4.6 found me a great solution. https://github.com/simonw/sqlite-chronicle/issues/20
This inspired me to point the latest models at a bunch of my older projects, resulting in a flurry of fixes and unblocks.
small_model 3 hours ago
I have a codebase (personal project) and every time there is a new Claude Opus model I get it to do a full code review. Never had any breakages in last couple of model updates. Worried one day it just generates a binary and deletes all the code.
TZubiri 2 hours ago
poszlem 17 minutes ago
This may seem obvious, but many people overlook it. The effect is especially clear when using an AI music model. For example, in Suno AI you can remaster an older AI generated track with a newer model. I do this with all my songs whenever a new model is released. It makes it super easy to see the improvements that were made to the models over time.
jauntywundrkind 3 hours ago
From the project description here for your sqlite-chronicle project:
> Use triggers to track when rows in a SQLite table were updated or deleted
Just a note in case its interesting to anyone, sqlite compatible Turso database has CDC, a changes table! https://turso.tech/blog/introducing-change-data-capture-in-t...
petesergeant 4 hours ago
I continue to get great value out of having claude and codex bound together in a loop: https://github.com/pjlsergeant/moarcode
apitman 3 hours ago
jama211 4 hours ago
Yeah I keep maintaining a specific app I built with gpt 5.1 codex max with that exact model because it continues to work for the requests I send it, and attempts with other models even 5.2 or 5.3 codex seemed to have odd results. If I were superstitious I would say it’s almost like the model that wrote the code likes to work on the code better. Perhaps there’s something about the structure it created though that it finds easier to understand…
seizethecheese 4 hours ago
> It feels like we are now able to manage incredibly smart engineers for a month at the price of a good sushi dinner.
In my experience it’s more like idiot savant engineers. Still remarkable.
WarmWash 4 hours ago
I have long suspected that a large part of people's distaste for given models comes from their comfort with their daily driver.
Which I guess feeds back to prompting still being critical for getting the most out of a model (outside of subjective stylistic traits the models have in their outputs).
worldsavior 4 hours ago
Sushy dinner? What are you building with AI, a calculator?
HardCodedBias 2 hours ago
"These models are so powerful."
Careful.
Gemini simply, as of 3.0, isn't in the same class for work.
We'll see in a week or two if it really is any good.
Bravo to those who are willing to give up their time to test for Google to see if the model is really there.
(history says it won't be. Ant and OAI really are the only two in this race ATM).
minimaxir 6 hours ago
Price is unchanged from Gemini 3 Pro: $2/M input, $12/M output. https://ai.google.dev/gemini-api/docs/pricing
Knowledge cutoff is unchanged at Jan 2025. Gemini 3.1 Pro supports "medium" thinking where Gemini 3 did not: https://ai.google.dev/gemini-api/docs/gemini-3
Compare to Opus 4.6's $5/M input, $25/M output. If Gemini 3.1 Pro does indeed have similar performance, the price difference is notable.
amluto an hour ago
Now compare the monthly plans for business users who want the CLI agent but who don’t want the models trained on their data.
OpenAI: no big deal — sign up, pick your number of seats, and you’re all set.
Anthropic: also no big deal but there’s an obnoxious minimum purchase.
Google: first you have to try to figure out what the product is called. Then you need to figure out how to set the correct IAM rules. Then you have to sign up and pay for it. Maybe you succeed. Maybe you give up after an hour or two of cursing. Gemini is, of course, completely unable to help. (OpenAI clearly has not trained their models on how to operate their tools. Google’s models hallucinate Google’s product offerings so outrageously that I’m not sure I can call. I haven’t asked Claude about Claude Code.)
At least the monthly pricing is similar once you get over the hurdles.
waffletower 20 minutes ago
Well some are using Anthropic on AWS Bedrock which is a bit more like the Google paragraph. Perhaps a good thing that Nova models aren't competitive (and many here are asking "What's a Nova model?"). And remember, many businesses aren't flinching at IAM controls and are asking for data privacy contracts.
miohtama 24 minutes ago
I can confirm the products bit, I tried to use Gemini to help with G Suite admin.
rancar2 5 hours ago
If we don't see a huge gain on the long-term horizon thinking reflected with the Vendor-Bench 2, I'm not going to switch away from CC. Until Google can beat Anthropic on that front, Claude Code paired with the top long-horizon models will continue to pull away with full stack optimizations at every layer.
jbellis 4 hours ago
still no minimal reasoning in G3.1P :(
(this is why Opus 4.6 is worth the price -- turning off thinking makes it 3x-5x faster but it loses only a small amount of intelligence. nobody else has figured that out yet)
sunaookami 2 hours ago
Thinking is just tacked on for Anthropic's models and always has been so leaving it off actually produces better results everytime.
girvo an hour ago
agentifysh 2 hours ago
Looks like its cheaper than codex ??? this might be interesting then
TZubiri 2 hours ago
It's not trained for agentic coding I don't think
oblio 2 hours ago
> Knowledge cutoff is unchanged at Jan 2025.
Isn't that a bit old?
minimaxir 2 hours ago
Old relative to its competitors, but the Search tool can compensate for it.
plaidfuji 5 hours ago
Sounds like the update is mostly system prompt + changes to orchestration / tool use around the core model, if the knowledge cutoff is unchanged
sigmar 5 hours ago
knowledge cutoff staying the same likely means they didn't do a new pre-train. We already knew there were plans from deepmind to integrate new RL changes in the post training of the weights. https://x.com/ankesh_anand/status/2002017859443233017
brokencode 4 hours ago
This keeps getting repeated for all kinds of model releases, but isn’t necessarily true. It’s possible to make all kinds of changes without updating the pretraining data set. You can’t judge a model’s newness based on what it knows about.
mijoharas 5 hours ago
Gemini 3 is still in preview (limited rate limits) and 2.5 is deprecated (still live but won't be for long).[0]
Are Google planning to put any of their models into production any time soon?
Also somewhat funny that some models are deprecated without a suggested alternative(gemini-2.5-flash-lite). Do they suggest people switch to Claude?
andrewmutz 5 hours ago
I agree completely. I don't know how anyone can be building on these models when all of them are either deprecated or not actually released yet. As someone who has production systems running on the deprecated models, this situation really causes me grief.
irthomasthomas 43 minutes ago
I dont think any of them really wants api customers in the end. They are only temporarily useful.
moffkalast an hour ago
Well let me use llama.cpp to run worlds-smallest-violin-Q8.gguf
When you build on something that can be rugpulled at any moment, that's really kind of on you.
NitpickLawyer 3 hours ago
You are reading your link wrong. They are deprecating 2.5-preview models. 2.5 (including lite) are up till at least sept/oct 26.
mijoharas 3 hours ago
gemini-2.5-pro has a listed shutdown date of "June 17, 2026" in the linked table.
(Another commenter pointed out that this is the earliest shutdown date and it won't necessarily be shut down on that date).
Where are you getting sept/Oct from? I see gemini-2.5-flash-image in October, but everything else looks like June/July to me?
vidarh 5 hours ago
This feels very Google
drbacon 2 hours ago
I found the Googler!
vidarh an hour ago
axpy906 2 hours ago
Have 2.5 in prod. Hope they release 3 lite soon so it will be easier to swap them. Holding my breath as pro pricing is a non starter.
jakejohnson 4 hours ago
I haven't seen any deprecation notices for 2.5 yet, just for 2. I'd expect (and hope) the deprecation timeline for 2.5 is longer since 3.0 is still in preview. Maybe they just default to 1 year here?
> Note: The shutdown dates listed in the table indicate the /earliest/ possible dates on which a model might be retired. We will communicate the exact shutdown date to users with advance notice to ensure a smooth transition to a replacement model.
mijoharas 3 hours ago
I think you're right, it was 2 I think I saw explicitly deprecated, then searched again and saw 2.5 having a shutdown date.
This article[0] talks about 2 being deprecated.
It's still frustrating that they don't have proper production endpoints for 3.0 yet.
logicchains an hour ago
It's the Tensorflow strategy; deprecate the old version while the new version still doesn't support all the old one's functionality.
incognito124 28 minutes ago
welcome to Google, where the only kinds of systems are the deprecated ones, and the not-yet-production-ready ones
1024core 5 hours ago
It got the car wash question perfectly:
You are definitely going to have to drive it there—unless you want to put it in neutral and push!
While 200 feet is a very short and easy walk, if you walk over there without your car, you won't have anything to wash once you arrive. The car needs to make the trip with you so it can get the soap and water.
Since it's basically right next door, it'll be the shortest drive of your life. Start it up, roll on over, and get it sparkling clean.
Would you like me to check the local weather forecast to make sure it's not going to rain right after you wash it?
1024core 4 hours ago
Some people are suggesting that this might actually be in the training set. Since I can't rule that out, I tried a different version of the question, with an elephant instead of a car:
> It's a hot and dusty day in Arizona and I need to wash my elephant. There's a creek 300 feet away. Should I ride my elephant there or should I just walk there by myself?
Gemini said:
That sounds like quite the dusty predicament! Given the options, you should absolutely ride (or lead) your elephant to the creek.
Here is the straightforward reality of the situation:
The Logistical Nightmare of Walking Alone: If you walk to the creek by yourself, your elephant stays at your starting point. Elephants require a massive amount of water for a bath. Unless you are planning an extreme workout carrying hundreds of buckets of water 300 feet back through the Arizona heat, you're going to have a very bad time—and the elephant will stay dusty.
The Perks of Going Together: 300 feet is a very short, easy stroll for an elephant. Plus, elephants generally love water and mud! Letting your elephant get right into the creek will make the washing process much easier for you and much more enjoyable for them. They can even use their trunk to help hose themselves down.
Grab some elephant-safe soap and head down to the water together!Would you like some tips on the best methods and tools for scrubbing down an elephant?
kovek 4 hours ago
I think that semantically this question is too similar to the car wash one. Changing subjects from car to elephant and car wash to creek does not change the fact that they are subjects. The embeddings will be similar in that dimension.
1024core 4 hours ago
irthomasthomas 37 minutes ago
i would say this is a lower difficulty. the car question primes it to think about stuff like energy and pollution.
jmkni an hour ago
I like telling it the car wash is 24,901.1 miles away (with the implication being that I'm right beside the car wash)
proto-n 3 hours ago
From Gemini pro:
You should definitely ride the elephant (or at least lead it there)!
Here is the logic:
If you walk there by yourself, you will arrive at the creek, but the dirty elephant will still be 300 feet back where you started. You can't wash the elephant if it isn't with you!
Plus, it is much easier to take the elephant to the water than it is to carry enough buckets of water 300 feet back to the elephant.
Would you like another riddle, or perhaps some actual tips on how to keep cool in the Arizona heat?
embedding-shape 4 hours ago
GPT-OSS-120b, a (downloadable) model released more than half a year ago also gets that right, I'm not sure this is such a great success.
> Would you like me to check the local weather forecast to make sure it's not going to rain right after you wash it?
Regardless of what I do, the days I decide to wash my car, it ALWAYS rains the day after, even if I look at the weather report and it says sunny. Cute that Gemini thinks it can predict it better than universe's will.
sushisource an hour ago
> Regardless of what I do, the days I decide to wash my car, it ALWAYS rains the day after
Undeniable universal truth. I sometimes find myself making plans based on the fact that the most annoying possible outcome is also the most likely one.
Aachen an hour ago
Non car person here. Why does that matter? It's not like rain means you didn't have to go to the wash, it rains often enough here that there wouldn't be car wash places left near me but there are plenty
opinion-is-bad 39 minutes ago
steve1977 4 hours ago
The question of course is, did it get the car wash question right because it is "the car wash question" or because it could actually infer why the car needed to be there?
embedding-shape 4 hours ago
Wasn't that "twoot" (or whatever Mastodon calls them) made just a week ago? Unlikely to have been in the training dataset of a model becoming available for public use today, unless Google made some serious advancements on the training front.
jama211 4 hours ago
Shouldn’t be too hard to come up with a new unique reasoning question
leumon 2 hours ago
Gemini 3 pro and flash already answered this correctly.
upmind 4 hours ago
The answer here is why I dislike Gemini, though it gets the correct answer, it's far too verbose.
KoolKat23 an hour ago
I can't stand a model over-explaining, needless fluff and wasting tokens. I asked the question so I know the context.
mikelward 2 hours ago
I don't love the verbosity of any of the chatbots when I'm using my phone, but at least it put the answer/tl;dr in the first paragraph.
patrickmcnamara 4 hours ago
Truly we entering the era of AGI.
suddenlybananas 4 hours ago
They probably had time to toss that example in the training soup.
AlphaAndOmega0 4 hours ago
Previous models from competitors usually got that correct, and the reasoning versions almost always did.
This kind of reflexive criticism isn't helpful, it's closer to a fully generalized counter-argument against LLM progress, whereas it's obvious to anyone that models today can do things they couldn't do six months ago, let alone 2 years back.
suddenlybananas 4 hours ago
buttered_toast 4 hours ago
I think we need to reevaluate what purpose these sorts of questions serve and why they're important in regards to judging intelligence.
The model getting it correct or not at any given instance isn't the point, the point is if the model ever gets it wrong we can still assume that it still has some semblance of stochasticity in its output, given that a model is essentially static once it is released.
Additionally, hey don't learn post training (except for in context which I think counts as learning to some degree albeit transient), if hypothetically it answers incorrectly 1 in 50 attempts, and I explain in that 1 failed attempt why it is wrong, it will still be a 1-50 chance it gets it wrong in a new instance.
This differs from humans, say for example I give an average person the "what do you put in a toaster" trick and they fall for it, I can be pretty confident that if I try that trick again 10 years later they will probably not fall for it, you can't really say that for a given model.
energy123 3 hours ago
They're important but not as N=1. It's like cherry picking a single question from SimpleQA and going aha! It got it right! Meanwhile it's 8% lower score than some other model when evaluated on all questions.
buttered_toast 3 hours ago
nickandbro 6 hours ago
Does well on SVGs outside of "pelican riding on a bicycle" test. Like this prompt:
"create a svg of a unicorn playing xbox"
https://www.svgviewer.dev/s/NeKACuHj
Still some tweaks to the final result, but I am guessing with the ARC-AGI benchmark jumping so much, the model's visual abilities are allowing it to do this well.
pugio 5 minutes ago
Unfortunately it still fails my personal SVG benchmark (educational 2d cross section of the human heart), even after multiple iterations and screenshots feedback. Oh well, back to the (human) drawing board.
ertgbnm an hour ago
Animated SVGs are one of the example in the press release. Which is fine, I just think the weird SVG benchmark is now dead. Gemini has beat the benchmark and now differences are just coming down to taste.
I don't know if it got these abilities through generalization or if google gave it a dedicated animated SVG RL suite that got it to improve so much between models.
Regardless we need a new vibe check benchmark ala bicycle pelican.
simonw 6 hours ago
Interesting how it went a bit more 3D with the style of that one compared to the pelican I got.
andy12_ 6 hours ago
I'm thinking now that as models get better and better at generating SVGs, there could be a point where we can use them to just make arbitrary UIs and interactive media with raw SVGs in realtime (like flash games).
rafark 3 hours ago
> there could be a point where we can use them to just make arbitrary UIs and interactive media with raw SVGs
So render ui elements using xml-like code in a web browser? You’re not going to believe me when I tell you this…
nickandbro 6 hours ago
Or quite literally a game where SVG assets are generated on the fly using this model
kridsdale3 4 hours ago
mclau153 3 hours ago
can we move on from SVG to 3D models at some point?
knicholes 12 minutes ago
Image to model is already a thing, and it's pretty good.
roryirvine 5 hours ago
On the other hand, creation of other vector image formats (eg. "create a postscript file showing a walrus brushing its teeth") hasn't improved nearly so much.
Perhaps they're deliberately optimising for SVG generation.
simonw 6 hours ago
Pretty great pelican: https://simonwillison.net/2026/Feb/19/gemini-31-pro/ - took over 5 minutes though, but I think that's because they're having performance teething problems on launch day.
embedding-shape 6 hours ago
It's an excellent demonstration of the main issue I have with the Gemini family of models, they always go "above and beyond" to do a lot of stuff, even if I explicitly prompt against it. In this case, most of the SVG ends up consisting not just of a bike and a pelican, but clouds, a sun, a hat on the pelican and so much more.
Exactly the same thing happens when you code, it's almost impossible to get Gemini to not do "helpful" drive-by-refactors, and it keeps adding code comments no matter what I say. Very frustrating experience overall.
mullingitover 5 hours ago
> it's almost impossible to get Gemini to not do "helpful" drive-by-refactors
Just asking "Explain what this service does?" turns into
[No response for three minutes...]
+729 -522
cowmoo728 5 hours ago
BartShoot 4 hours ago
quotemstr 4 hours ago
moffkalast an hour ago
kylec 5 hours ago
Yizahi 7 minutes ago
Asking LLM programs to "not do the thing" often results in them tripping and generating output including that "thing", since those are simply the tokens which will enter the input. I always try to rephrase query the way that all my instructions have only "positive" forms - "do only this" or "do it only in that way" or "do it only for those parameters requested" etc. Can't say if that helps much, but it is possible.
h14h 4 hours ago
Would be really interesting to see an "Eager McBeaver" bench around this concept. When doing real work, a model's ability to stay within the bounds of a given task has almost become more important than its raw capabilities now that every frontier model is so dang good.
Every one of these models is so great at propelling the ship forward, that I increasingly care more and more about which models are the easiest to steer in the direction I actually want to go.
h14h 29 minutes ago
cglan 3 hours ago
enobrev 5 hours ago
I have the same issue. Even when I ask it to do code-reviews and very explicitly tell it not to change files, it will occasionally just start "fixing" things.
mikepurvis 5 hours ago
neya 5 hours ago
> it's almost impossible to get Gemini to not do "helpful" drive-by-refactors
This has not been my experience. I do Elixir primarily and Gemini has helped build some really cool products and massive refactors along the way. And it would even pick up security issues and potential optimizations along the way
What HAS been an issue constantly though was randomly the model will absolutely not respond at all and some random error would occur which is embarrassing for a company like Google with the infrastructure they own.
embedding-shape 4 hours ago
msteffen 3 hours ago
> it's almost impossible to get Gemini to not do "helpful" drive-by-refactors
Not like human programmers. I would never do this and have never struggled with it in the past, no...
embedding-shape 3 hours ago
apitman 3 hours ago
This matches my experience using Gemini CLI to code. It would also frequently get stuck in loops. It was so bad compared to Codex that I feel like I must have been doing something fundamentally wrong.
tyfon 5 hours ago
I was using gemini antigravity in opencode a few weeks ago before they started banning everyone for that and I got into the habit of writing "do x, then wait for instructions".
That helped quite a bit but it would still go off on it's own from time to time.
JLCarveth 4 hours ago
Every time I have tried using `gemini-cli` it just thinks endlessly and never actually gives a response.
gavinray 5 hours ago
Do you have Personalization Instructions set up for your LLM models?
You can make their responses fairly dry/brief.
embedding-shape 5 hours ago
metal_am 5 hours ago
zengineer 5 hours ago
true, whenever I ask Gemini to help me with a prompt for generating an image of XYZ, it generates the image.
jasonjmcghee 4 hours ago
What's crazy is you've influenced them to spend real effort ensuring their model is good at generating animated svgs of animals operating vehicles.
The most absurd benchmaxxing.
https://x.com/jeffdean/status/2024525132266688757?s=46&t=ZjF...
simonw 3 hours ago
I like how they also did a frog on a penny-farthing and a giraffe driving a tiny car and an ostrich on roller skates and a turtle kickflipping a skateboard and a dachshund driving a stretch limousine.
jasonjmcghee 2 hours ago
simultsop 2 hours ago
threatofrain 3 hours ago
Animated SVG is huge. People in different professions are worrying to different degrees in terms of being replaced by ML, but this one is huge with regards to digital art.
yieldcrv an hour ago
eurekin 3 hours ago
Can't wait until they finally get to real world CAD
tngranados 3 hours ago
tantalor 3 hours ago
He's svg-mogging
gnatolf 3 hours ago
So let's put things we're interested in in the benchmarks.
I'm not against pelicans!
ghurtado 3 hours ago
casey2 3 hours ago
You don't have to benchmax everything, just the benchmarks in the right social circles
UltraSane 3 hours ago
It if funny to think that Jeff Dean personally worked to optimize the pelican riding a bike benchmark.
MrCheeze 5 hours ago
Does anyone understand why LLMs have gotten so good at this? Their ability to generate accurate SVG shapes seems to greatly outshine what I would expect, given their mediocre spatial understanding in other contexts.
tedsanders 3 hours ago
A few thoughts:
- One thing to be aware of is that LLMs can be much smarter than their ability to articulate that intelligence in words. For example, GPT-3.5 Turbo was beastly at chess (1800 elo?) when prompted to complete PGN transcripts, but if you asked it questions in chat, its knowledge was abysmal. LLMs don't generalize as well as humans, and sometimes they can have the ability to do tasks without the ability to articulate things that feel essential to the tasks (like answering whether the bicycle is facing left or right).
- Secondly, what has made AI labs so bullish on future progress over the past few years is that they see how little work it takes to get their results. Often, if an LLM sucks at something that's because no one worked on it (not always, of course). If you directly train a skill, you can see giant leaps in ability with fairly small effort. Big leaps in SVG creation could be coming from relatively small targeted efforts, where none existed before.
emp17344 2 hours ago
simonw 4 hours ago
My best guess is that the labs put a lot of work into HTML and CSS spatial stuff because web frontend is such an important application of the models, and those improvements leaked through to SVG as well.
mitkebes 2 hours ago
All models have improved, but from my understanding, Gemini is the main one that was specifically trained on photos/video/etc in addition to text. Other models like earlier chatgpt builds would use plugins to handle anything beyond text, such as using a plugin to convert an image into text so that chatgpt could "see" it.
Gemini was multimodal from the start, and is naturally better at doing tasks that involve pictures/videos/3d spatial logic/etc.
The newer chatgpt models are also now multimodal, which has probably helped with their svg art as well, but I think Gemini still has an edge here
pknerd 4 hours ago
> Does anyone understand why LLMs have gotten so good at this?
Added more IF/THEN/ELSE conditions.
kridsdale3 4 hours ago
sam_1421 5 hours ago
Models are soon going to start benchmaxxing generating SVGs of pelicans on bikes
cbsks 5 hours ago
That’s Simon’s goal. “All I’ve ever wanted from life is a genuinely great SVG vector illustration of a pelican riding a bicycle. My dastardly multi-year plan is to trick multiple AI labs into investing vast resources to cheat at my benchmark until I get one.”
https://simonwillison.net/2025/Nov/13/training-for-pelicans-...
travisgriggs 4 hours ago
embedding-shape 5 hours ago
Soon? I'd be willing to bet it's been included in the training set at least 6 months by now. Not so obvious so it generates always perfect pelicans on bikes, but sufficiently for the "minibench" to be less useful today than in the past.
Rudybega 2 hours ago
jsheard 5 hours ago
Simons been doing this exact test for nearly 18 months now, if vendors want to benchmaxx it then they've had more than enough time to do so already.
stri8ted 5 hours ago
ks2048 3 hours ago
Forget the paperclip maximizer - AGI will turn the whole world into pelicans on bikes.
brikym an hour ago
Another great benchmark would be to convert a raster image of a logo into SVG. I've yet to find a good tool for this that produces accurate smooth lines.
SoKamil 5 hours ago
It seems they trained the model to output good svg’s.
In their blog post[1], first use case they mention is svg generation. Thus, it might not be any indicator at all anymore.
[1] https://blog.google/innovation-and-ai/models-and-research/ge...
culi 2 hours ago
Cost per task has increased 4.2x but their ARC-AGI-2 score went from 33.6% to 77.1%
Cost per task is still significantly lower than Opus. Even Opus 4.5
Arcuru 6 hours ago
Did you stop using the more detailed prompt? I think you described it here: https://simonwillison.net/2025/Nov/18/gemini-3/
simonw 4 hours ago
It seems to be having capacity problems right now but I'll run that as soon as I can get it to work.
simonw 2 hours ago
WarmWash 5 hours ago
Less pretty and more practical, it's really good at outputting circuit designs as SVG schematics.
InitialLastName 5 hours ago
I don't know what of this is the prompt and what was the output, but that's a pretty bad schematic (for both aesthetic and circuit-design reasons).
WarmWash 4 hours ago
svnt 5 hours ago
0_____0 5 hours ago
that's pretty amazing for an LLM but as an EE, if my intern did this i would sigh inwardly and pull up some existing schematics for some brief guidance on symbol layout.
AmazingTurtle 5 hours ago
At this point, the pelican benchmark became so widely used that there must be high quality pelicans in the dataset, I presume. What about generating an okapi on a bicycle instead?
ascorbic 3 hours ago
Loads of examples here https://x.com/jeffdean/status/2024525132266688757
tromp 4 hours ago
Or, even more challenging, an okapi on a recumbent ?!
steve_adams_86 5 hours ago
Ugh, the gears and chain don't mesh and there's no sprocket on the rear hub
But seriously, I can't believe LLMs are able to one-shot a pelican on a bicycle this well. I wouldn't have guessed this was going to emerge as a capability from LLMs 6 years ago. I see why it does now, but... It still amazes me that they're so good at some things.
emp17344 5 hours ago
Is this capability “emergent”, or do AI firms specifically target SVG generation in order to improve it? How would we be able to tell?
steve_adams_86 4 hours ago
simonw 4 hours ago
0_____0 5 hours ago
next time you host a party, have people try to draw a bicycle on your whiteboard (you have a whiteboard in your house right? you should, anyway...)
human adults are generally quite bad at drawing them, unless they spend a lot of time actually thinking about bicycles as objects
542354234235 5 hours ago
emp17344 5 hours ago
HPsquared 5 hours ago
And the left leg is straight while the right leg is bent.
EDIT: And the chain should pass behind the seat stay.
bredren 6 hours ago
What is that, a snack in the basket?
sigmar 6 hours ago
"integrating a bicycle basket, complete with a fish for the pelican... also ensuring the basket is on top of the bike, and that the fish is correctly positioned with its head up... basket is orange, with a fish inside for fun."
how thoughtful of the ai to include a snack. truly a "thanks for all the fish"
defen 5 hours ago
WarmWash 6 hours ago
A fish for the road
tarr11 4 hours ago
What do you think this particular prompt is evaluating for?
The more popular these particular evals are, the more likely the model will be trained for them.
Gander5739 4 hours ago
TZubiri 2 hours ago
You think they are able to see their output and iterate on it? Or is it pure token generation?
infthi 5 hours ago
Wonder when will we get something other than a side view
mikepurvis 5 hours ago
That would be a especially challenging for vector output. I tried just now on ChatGPT 5.2 to jump straight to an image, with this prompt:
"make me a cartoon image of a pelican riding a bicycle, but make it from a front 3/4 view, that is riding toward the viewer."
The result was basically a head-on view, but I expect if you then put that back in and said, "take this image and vectorize it as an SVG" you'd have a much better time than trying to one-shot the SVG directly from a description.
... but of course, if that's so, then what's preventing the model from being smart enough to identify this workflow and follow it on its own to get the task completed?
calny 6 hours ago
Great pelican but what’s up with that fish in the basket?
coldtea 5 hours ago
It's a pelican. What do you expect a pelican to have in his bike's basket?
It's a pretty funny and coherent touch!
embedding-shape 5 hours ago
nicr_22 4 hours ago
Yeah, why only _one_ fish?
It's obvious that pelican is riding long distance, no way a single fish is sufficiently energy dense for more than a few miles.
Can't the model do basic math???
gavinray 5 hours ago
Where else are cycling Pelican's meant to keep their fish?
calny 5 hours ago
mohsen1 5 hours ago
is there something in your prompt about hats? why the pelican always wearing a hat recently?!
bigfishrunning 5 hours ago
At this point, i think maybe they're training on all of the previous pelicans, and one of them decided to put a hat on it?
Disclaimer: This is an unsubstantiated claim that i made up
xnx 6 hours ago
Not even animated? This is 2026.
readitalready 5 hours ago
Jeff Dean just posted an animated version: https://x.com/JeffDean/status/2024525132266688757
benbreen 5 hours ago
bigfishrunning 5 hours ago
DonHopkins 4 hours ago
How about STL files for 3d printing pelicans!
baq 3 hours ago
Harder: the bike must work
Hardest: the pelican must work
benatkin 5 hours ago
I used the AI studio link and tried running it with the temperature set to 1.75: https://jsbin.com/locodaqovu/edit?html,output
saberience 5 hours ago
I hope we keep beating this dead horse some more, I'm still not tired of it.
Robdel12 6 hours ago
I really want to use google’s models but they have the classic Google product problem that we all like to complain about.
I am legit scared to login and use Gemini CLI because the last time I thought I was using my “free” account allowance via Google workspace. Ended up spending $10 before realizing it was API billing and the UI was so hard to figure out I gave up. I’m sure I can spend 20-40 more mins to sort this out, but ugh, I don’t want to.
With alllll that said.. is Gemini 3.1 more agentic now? That’s usually where it failed. Very smart and capable models, but hard to apply them? Just me?
alpineman 5 hours ago
100% agreed. I wish someone would make a test for how reliably the LLMs follow tool use instructions etc. The pelicans are nice but not useful for me to judge how well a model will slot into a production stack.
embedding-shape 5 hours ago
At first when I got started with using LLMs I read/analyzed benchmarks, looked at what example prompts people used and so on, but many times, a new model does best at the benchmark, and you think it'll be better, but then in real work, it completely drops the ball. Since then I've stopped even reading benchmarks, I don't care an iota about them, they always seem more misdirected than helpful.
Today I have my own private benchmarks, with tests I run myself, with private test cases I refuse to share publicly. These have been built up during the last 1/1.5 years, whenever I find something that my current model struggles with, then it becomes a new test case to include in the benchmark.
Nowadays it's as easy as `just bench $provider $model` and it runs my benchmarks against it, and I get a score that actually reflects what I use the models for, and it feels like it more or less matches with actually using the models. I recommend people who use LLMs for serious work to try the same approach, and stop relying on public benchmarks that (seemingly) are all gamed by now.
cdelsolar 5 hours ago
phamilton 5 hours ago
> For those building with a mix of bash and custom tools, Gemini 3.1 Pro Preview comes with a separate endpoint available via the API called gemini-3.1-pro-preview-customtools. This endpoint is better at prioritizing your custom tools (for example view_file or search_code).
It sounds like there was at least a deliberate attempt to improve it.
pdntspa 5 hours ago
You can delete the billing from a given API key
Stevvo 5 hours ago
You could always use it through Copilot. The credits based billing is pretty simple without surprise charges.
surgical_fire 5 hours ago
May be very silly of me, but I avoid using Gemini on my personal Google account. I use it at work, because my employer provides it.
I am scared some automated system may just decide I am doing something bad and terminate my account. I have been moving important things to Proton, but there are some stuff that I couldn't change that would cause me a lot of annoyance. It's not trivial to set up an alternative account just for Gemini, because my Google account is basically on every device I use.
I mostly use LLMs as coding assistant, learning assistant, and general queries (e.g.: It helped me set up a server for self hosting), so nothing weird.
paganel 7 minutes ago
Same feeling here, if it makes you feel any better (for sure it made me better seeing I'm not alone in this).
CamperBob2 4 hours ago
For what it's worth, there was an (unfortunately unsuccessful) HN submission from a guy who got his Gemini account banned, apparently without losing his whole Google account: https://news.ycombinator.com/item?id=47007906
surgical_fire 2 hours ago
abiraja 4 hours ago
I've been using it lately with OpenCode and it's working pretty well (except for API reliability issues).
horsawlarway 5 hours ago
So much this.
It's absolutely amazing how hostile Google is to releasing billing options that are reasonable, controllable, or even fucking understandable.
I want to do relatively simple things like:
1. Buy shit from you
2. For a controllable amount (ex - let me pick a limit on costs)
3. Without spending literally HOURS trying to understand 17 different fucking products, all overlapping, with myriad project configs, api keys that should work, then don't actually work, even though the billing links to the same damn api key page, and says it should work.
And frankly - you can't do any of it. No controls (at best delayed alerts). No clear access. No real product differentiation pages. No guides or onboarding pages to simplify the matter. No support. SHIT LOADS of completely incorrect and outdated docs, that link to dead pages, or say incorrect things.
So I won't buy shit from them. Period.
sciencejerk 5 hours ago
You think AWS is better?
3form 5 hours ago
pdimitar 4 hours ago
himata4113 5 hours ago
use openrouter instead
Robdel12 3 hours ago
This is actually an excellent idea, I’ll give this a shot tonight!
WarmWash 5 hours ago
3.1 Pro is the first model to correctly count the number of legs on my "five legged dog" test image. 3.0 flash was the previous best, getting it after a few prompts of poking. 3.1 got it on the first prompt though, with the prompt being "How many legs does the dog have? Count Carefully".
However, it didn't get it on the first try with the original prompt (prompt: "How many legs does the dog have?"). It initially said 4, then with a follow up prompt got it to hesitantly say 5, with one limb must being obfuscated or hidden.
So maybe I'll give it a 90%?
This is without tools as well.
merlindru 5 hours ago
your question may have become part of the training data with how much coverage there was around it. perhaps you should devise a new test :P
devsda 3 hours ago
I suggest asking it to identify/count the number of fire hydrants, crosswalks, bridges, bicycles, cars, buses and traffic signals etc.
Pit Google against Google :D
iamdelirium 4 hours ago
3.1 Pro has the same Jan 2025 knowledge cutoff as the other 3 series models. So if 3.1 has it in its training data, the other ones would have as well.
ainch an hour ago
gallerdude 5 hours ago
My job may have become part of the training data with how much coverage there is around it. Perhaps another career would be a better test of LLM capabilities.
suddenlybananas 4 hours ago
WarmWash 4 hours ago
Honestly at this point I have fed this image in so many times on so many models, that it also functions as a test for "Are they training on my image specifically" (they are generally, for sure, but that's along with everything else in the ocean of info people dump in).
I genuinely don't think they are. GPT-5.2 still stands by 4 legs, and OAI has been getting this image consistently for over a year. And 3.1 still fumbled with the harder prompt "How many legs does the dog have?". I needed to add the "count carefully" part to tip it off that something was amiss.
Since it did well I'll make some other "extremely far out of the norm" images to see how it fairs. A spider with 10 legs or a fish with two side fins.
wat10000 4 hours ago
Easy fix, make a new test image with six legs, and watch all the LLMs say it has five.
sigmar 6 hours ago
blog post is up- https://blog.google/innovation-and-ai/models-and-research/ge...
edit: biggest benchmark changes from 3 pro:
arc-agi-2 score went from 31.1% -> 77.1%
apex-agents score went from 18.4% -> 33.5%
ripbozo 6 hours ago
Does the arc-agi-2 score more than doubling in a .1 release indicate benchmark-maxing? Though i dont know what arc-agi-2 actually tests
maxall4 6 hours ago
Theoretically, you can’t benchmaxx ARC-AGI, but I too am suspect of such a large improvement, especially since the improvement on other benchmarks is not of the same order.
moffkalast an hour ago
boplicity 6 hours ago
Benchmark maxing could be interpreted as benchmarks actually being a design framework? I'm sure there are pitfalls to this, but it's not necessarily bad either.
energy123 4 hours ago
Francois Chollet accuses the big labs of targeting the benchmark, yes. It is benchmaxxed.
tasuki 3 hours ago
CamperBob2 4 hours ago
blinding-streak 6 hours ago
I assume all the frontier models are benchmaxxing, so it would make sense
sho_hn 6 hours ago
The touted SVG improvements make me excited for animated pelicans.
takoid 6 hours ago
I just gave it a shot and this is what I got: https://codepen.io/takoid/pen/wBWLOKj
The model thought for over 5 minutes to produce this. It's not quite photorealistic (some parts are definitely "off"), but this is definitely a significant leap in complexity.
onionisafruit 5 hours ago
tasuki 2 hours ago
makeavish 6 hours ago
benatkin 5 hours ago
james2doyle 5 hours ago
The blog post includes a video showcasing the improvements. Looks really impressive: https://blog.google/innovation-and-ai/models-and-research/ge...
aoeusnth1 6 hours ago
I imagine they're also benchgooning on SVG generation
rdtsc 4 hours ago
My perennial joke is as soon as that got on HN front page Google went and hired some interns and they spend a 100% of the time on pelicans.
vunderba 5 hours ago
SVG is an under-rated use case for LLMs because it gives you the scalability of vector graphics along with CSS-style interactivity (hover effects, animations, transitions, etc.).
DonHopkins 4 hours ago
How about STL files for 3d printing pelicans!
zapnuk 27 minutes ago
Gemini 3 was:
1. unreliable in GH copilot. Lots of 500 and 4XX errors. Unusable in the first 2 months
2. not available in vertex ai (europe). We have requirements regarding data residency. Funny enough anthropic is on point with releasing their models to vertex ai. We already use opus and sonnet 4.6.
I hope google gets their stuff together and understands that not everyone wants/can use their global endpoint. We'd like to try their models.
esafak 6 hours ago
Has anyone noticed that models are dropping ever faster, with pressure on companies to make incremental releases to claim the pole position, yet making strides on benchmarks? This is what recursive self-improvement with human support looks like.
emp17344 6 hours ago
Remember when ARC 1 was basically solved, and then ARC 2 (which is even easier for humans) came out, and all of the sudden the same models that were doing well on ARC 1 couldn’t even get 5% on ARC 2? Not convinced these benchmark improvements aren’t data leakage.
culi an hour ago
Look at the ARC site. The scores of these models is plotted against their "cost per task". All of these huge jumps come along with massive increases in cost per task. Including Gemini 3.1 Pro which increased by 4.2x
casey2 2 hours ago
ARC 2 was made specifically to artificially lower contemporary LLM scores, therefore any kind of model improvements will have outsized effects
Also people use "saturated" too liberally. The top left corner 1 cent per task is saturated IMO. Since there are billions of people who would perfer to solve arc 1 tasks at 52 cents per task. Arc 2 a human would make thousands of dollars a day with 99.99% accuracy
z3t4 an hour ago
alisonkisk 2 hours ago
redox99 6 hours ago
I don't think there's much recursive improvement yet.
I'd say it's a combination of
A) Before, new model releases were mostly a new base model trained from scratch, with more parameters and more tokens. This takes many Months. Now that RL is used so heavily, you can make infinitely many tweaks to the RL setup, and in just a month get a better model using the same base model.
B) There's more compute online
C) Competition is more fierce.
culi an hour ago
I feel like they're actually dropping slower. Chinese models are dropping right before lunar new year as seems to be an emerging tradition.
A couple of western models have dropped around the same time too but I don't think the "strides on benchmarks" are that impressive when you consider how much tokens are being spent to make those "improvements". E.g. Gemini 3.1 Pro's ARC-AGI-2 score went from 33.6% to 77.1% buuut their "cost per task" also increased by 4.2x. It seems to be the same story for most of these benchmark improvements and similar for Claude model improvements.
I'm not convinced there's been any substantial jump in capabilities. More likely these companies have scaled their datacenters to allow for more token usage
m_ke 5 hours ago
this is mostly because RLVR is driving all of the recent gains, and you can continue improving the model by running it longer (+ adding new tasks / verifiers)
so we'll keep seeing more frequent flag planting checkpoint releases to not allow anyone to be able to claim SOTA for too long
ainch an hour ago
It's becoming impossible to keep up - in the last week or so we've had: Gemini 3 Deep Think, Gemini 3.1 Pro, Claude Sonnet 4.6, GPT-5.3-Codex Spark, GLM-5, Minimax-2.5, Step 3.5 Flash, Qwen 3.5 and Grok 4.20.
and I'm sure others I've missed...
ankit219 4 hours ago
not much to do with self improvement as such. openai has increased its pace, others are pretty much consistent. Google last year had three versions of gemini-2.5-pro each within a month of each other. Anthropic released claude 3 in march 24, sonnet 3.5 in june 24, 3.5 new in oct 24, and then 3.7 in feb 25, where they went to 4 series in May 25. then followed by opus 4.1 in august, sonnet 4.5 in oct, opus 4.5 in nov, 4.6 in feb, sonnet 4.6 in feb itself. Yes, they released both within weeks of each other, but originally they only released it together. This staggered release is what creates the impression of fast releases. its as much a function of training as a function of available compute, and they have ramped up in that regard.
oliveiracwb 5 hours ago
With the advent of MoEs, efficiency gains became possible. However, MoEs still operate far from the balance and stability of dense models. My view is that most progress comes from router tuning based on good and bad outcomes, with only marginal gains in real intelligence
nikcub 4 hours ago
and anyone notice that the pace has broken xAI and they were just dropped behind? The frontier improvement release loop is now ant -> openai -> google
gavinray 3 hours ago
xAI just released Grok 4.20 beta yesterday or day before?
dist-epoch 3 hours ago
Musk said Grok 5 is currently being trained, and it has 7 trillion params (Grok 4 had 3)
svara 2 hours ago
gmerc 4 hours ago
That's what scaling compute depth to respond to the competition look like, lighting those dollars on fire.
toephu2 3 hours ago
This is what competition looks like.
PlatoIsADisease 6 hours ago
Only using my historical experience and not Gemini 3.1 Pro, I think we see benchmark chasing then a grand release of a model that gets press attention...
Then a few days later, the model/settings are degraded to save money. Then this gets repeated until the last day before the release of the new model.
If we are benchmaxing this works well because its only being tested early on during the life cycle. By middle of the cycle, people are testing other models. By the end, people are not testing them, and if they did it would barely shake the last months of data.
KoolKat23 37 minutes ago
I have a relatively consistent task that it completed with new information on weekdays at the edge of its intelligence. Interestingly 3.0 flash was good when it came out, took a nose dive a month back and is now excellent, I actually can't fault it it's so good.
It's performance in antigravity has also actually improved since launch day where it was giving non-stop typescript errors (not sure if that was antigravity itself).
the_duke 6 hours ago
Gemini 3 is pretty good, even Flash is very smart for certain things, and fast!
BUT it is not good at all at tool calling and agentic workflows, especially compared to the recent two mini-generations of models (Codex 5.2/5.3, the last two versions of Anthropic models), and also fell behind a bit in reasoning.
I hope they manage to improve things on that front, because then Flash would be great for many tasks.
chermi 6 hours ago
You can really notice the tool use problems. They gotta get on that. The agent trend seems real, and powerful. They can't afford to fall behind on it.
HardCodedBias 2 hours ago
"They can't afford to fall behind on it."
They are very, very seriously far behind as of 3.0.
We'll see if 3.1 addresses the issue at all.
verdverm 5 hours ago
I don't really have tool usage issues that I don't put under that doesn't follow system prompt instructions consistently
there are these times where it puts a prefix on all function calls, which is weird and I think hallucination, so maybe that one
3.1 hopefully fixes that
verdverm 5 hours ago
These improvements are one of the things specifically called out on the submitted page
anthonypasq 6 hours ago
yeah, it seems to me like Gemini is a little behind on the current RL patterns and also they dont seem interested in really creating a dedicated coding model. I think they have so much product surface (search, AI mode, gmail, youtube, chrome etc), they are prioritizing making the model very general. but who knows im just talking out of my ass.
spwa4 6 hours ago
In other words: they just need to motivate their employees while giving in to finance's demands to fire a few thousand every month or so ...
And don't forget, it's not just direct motivation. You can make yourself indispensable by sabotaging or at least not contributing to your colleagues' efforts. Not helping anyone, by the way, is exactly what your managers want you to do. They will decide what happens, thank you very much, and doing anything outside of your org ... well there's a name for that, isn't there? Betrayal, or perhaps death penalty.
maxloh 6 hours ago
Gemini 3 seems to have a much smaller token output limit than 2.5. I used to use Gemini to restructure essays into an LLM-style format to improve readability, but the Gemini 3 release was a huge step back for that particular use case.
Even when the model is explicitly instructed to pause due to insufficient tokens rather than generating an incomplete response, it still truncates the source text too aggressively, losing vital context and meaning in the restructuring process.
I hope the 3.1 release includes a much larger output limit.
NoahZuniga 5 hours ago
Output limit has consistently been 64k tokens (including 2.5 pro).
esafak 6 hours ago
People did find Gemini very talkative so it might be a response to that.
jayd16 6 hours ago
> Even when the model is explicitly instructed to pause due to insufficient tokens
Is there actually a chance it has the introspection to do anything with this request?
maxloh 5 hours ago
Yeah, it does. It was possible with 2.5 Flash.
Here's a similar result with Qwen Qwen3.5-397B-A17B: https://chat.qwen.ai/s/530becb7-e16b-41ee-8621-af83994599ce?...
jayd16 5 hours ago
verdverm 4 hours ago
No, the model doesn't have purview into this afaik
I'm not even sure what "pausing" means in this context and why it would help when there are insufficient tokens. They should just stop when you reach the limit, default or manually specified, but it's typically a cutoff.
You can see what happens by setting output token limit much lower
otabdeveloper4 5 hours ago
No.
MallocVoidstar 6 hours ago
> Even when the model is explicitly instructed to pause due to insufficient tokens rather than generating an incomplete response
AI models can't do this. At least not with just an instruction, maybe if you're writing some kind of custom 'agentic' setup.
maxloh 5 hours ago
Yeah, it does. It was possible with 2.5 Flash.
Here's a similar result with Qwen Qwen3.5-397B-A17B: https://chat.qwen.ai/s/530becb7-e16b-41ee-8621-af83994599ce?...
davidguetta 5 hours ago
Implementation and Sustainability Hardware: Gemini 3 Pro was trained using Google’s Tensor Processing Units (TPUs). TPUs are specically designed to handle the massive computations involved in training LLMs and can speed up training considerably compared to CPUs. TPUs often come with large amounts of high-bandwidth memory, allowing for the handling of large models and batch sizes during training, which can lead to better model quality. TPU Pods (large clusters of TPUs) also provide a scalable solution for handling the growing complexity of large foundation models. Training can be distributed across multiple TPU devices for faster and more efficient processing.
So google doesn't use NVIDIA GPUs at all ?
dekhn 5 hours ago
When I worked there, there was a mix of training on nvidia GPUs (especially for sparse problems when TPUs weren't as capable), CPUs, and TPUs. I've been gone for a few years but I've heard a few anecdotal statements that some of their researchers have to use nvidia GPUs because the TPUs are busy.
sdeiley 28 minutes ago
Googler. We use GPUs, but its a drop in the bucket in the sea of our accelerators. We might sell more GPUs in Cloud than we use internally.
These are not data driven observations just vibes
rjh29 2 hours ago
I assume that's a Gemini LLM response? You can tell Gemini is bullshitting when it starts using "often" or "usually" - like in this case "TPUs often come with large amounts of memory". Either they did or they didn't. "This (particular) mall often has a Starbucks" was one I encountered recently.
w10-1 an hour ago
It's not bullshit (i.e., intended) but probabilities all the way down, as Hume reminded us: from observations, you can only say the sun will likely rise in the east. You'd need to stand behind a theory of the world to say otherwise (but we were told "attention is all you need"...)
PunchTornado 5 hours ago
no. only tpus
paride5745 5 hours ago
Another reason to use Gemini then.
Less impact on gamers…
TiredOfLife 4 hours ago
TPUs still use ram and chip production capacity
lejalv 5 hours ago
Bla bla bla yada sustainability yada often come with large better growing faster...
It's such an uninformative piece of marketing crap
tenpoundhammer 5 hours ago
In an attempt to get outside of benchmark gaming I had it make Platypus on a Tricycle. It's not as good as pelican on bicycle. https://www.svgviewer.dev/s/BiRht5hX
textlapse 5 hours ago
To really confuse it, ask it to take that tricycle with the platypus on it to a car wash.
dinosor 5 hours ago
For a moment I assumed the output would look like Perry the Platipus from the Disney (I think?) show. It's suprising to me (as a layman) that a show with lots of media that would've made it to the training corpus didn't show up.
0_____0 5 hours ago
that's better than i thought it would be
zhyder 6 hours ago
Surprisingly big jump in ARC-AGI-2 from 31% to 77%, guess there's some RLHF focused on the benchmark given it was previously far behind the competition and is now ahead.
Apart from that, the usual predictable gains in coding. Still is a great sweet-spot for performance, speed and cost. Need to hack Claude Code to use their agentic logic+prompts but use Gemini models.
I wish Google also updated Flash-lite to 3.0+, would like to use that for the Explore subagent (which Claude Code uses Haiku for). These subagents seem to be Claude Code's strength over Gemini CLI, which still has them only in experimental mode and doesn't have read-only ones like Explore.
WarmWash 6 hours ago
>I wish Google also updated Flash-lite to 3.0+
I hope every day that they have made gains on their diffusion model. As a sub agent it would be insane, as it's compute light and cranks 1000+ tk/s
zhyder 5 hours ago
Agree, can't wait for updates to the diffusion model.
Could be useful for planning too, given its tendency to think big picture first. Even if it's just an additional subagent to double-check with an "off the top off your head" or "don't think, share first thought" type of question. More generally would like to see how sequencing autoregressive thinking with diffusion over multiple steps might help with better overall thinking.
siliconc0w 25 minutes ago
Google has a hugely valuable dataset of changes from decades of changes from top tier software engineers but it's so proprietary they can't use it to train their external models.
qingcharles 6 hours ago
I've been playing with the 3.1 Deep Think version of this for the last couple of weeks and it was a big step up for coding over 3.0 (which I already found very good).
It's only February...
nubg 5 hours ago
> I've been playing with the 3.1 Deep Think version of this
How?
verdverm 4 hours ago
A select few have had early access through various programs Google offers. I believe there was a sentence or two to this effect on the Gemini 3 Deep Think post from Deepmind.
vnglst 3 hours ago
I asked Gemini 3.1 Pro to generate some of the modern artworks in my "Pelican Art Gallery". I particularly like the rendition of the Sunflowers: https://pelican.koenvangilst.nl/gallery/category/modern
janalsncm 5 hours ago
This model says it accepts video inputs. I asked it to transcribe a 5 second video of a digital water curtain which spelled “Boo Happy Halloween”, and it came back with “Happy” which wasn’t the first frame, but also is incomplete.
This kind of test is good because it requires stitching together info from the whole video.
aabhay 5 hours ago
It reads videos at 1fps by default. You have to set the video resolution to high in ai studio
janalsncm 41 minutes ago
This is inside the Gemini app.
timabdulla 5 hours ago
Google tends to trumpet preview models that aren't actually production-grade. For instance, both 3 Pro and Flash suffer from looping and tool-calling issues.
I would love for them to eliminate these issues because just touting benchmark scores isn't enough.
agentifysh 3 hours ago
My enthusiasm is a bit muted this cycle because I've been burned by Gemini CLI. These models are very capable but Gemini CLI just doesn't seem to be able to work for one it never follows instructions strictly like its competitors do, and it hallucinates even which is a rarity.
More importantly feels like Google is stretched thin across different Gemini products and pricing reflects this, I still have no idea how to pay for Gemini CLI, in codex/claude its very simple $20/month for entry and $200/month for ton of weekly usage.
I hope whoever is reading this from Google they can redeem Gemini CLI by focusing on being competitive instead of making it look pretty (that seems to be the impression I got from the updates on X)
cheema33 36 minutes ago
> I still have no idea how to pay for Gemini CLI, in codex/claude its very simple $20/month for entry and $200/month for ton of weekly usage.
This!
I would like to sign up for a paid plan for Gemini CLI. But I have not been able to figure out how. I already have Codex and Claude plans. Those were super easy to sign up for.
ismailmaj an hour ago
3.1 feels to me like 3.0 but that takes a long time to think, it didn't feel like a leap in raw intelligence like 2.5 pro was.
vnglst 2 hours ago
I asked Gemini 3.1 Pro Preview to generate the modern artworks as SVG for my Pelican Art Gallery. I particularly like the rendition of the Sunflowers: https://pelican.koenvangilst.nl/gallery/category/modern
ArmandoAP 5 hours ago
dxbednarczyk 6 hours ago
Every time I've used Gemini models for anything besides code or agentic work they lean so far into the RLHF induced bold lettering and bullet point list barf that everything they output reads as if the model was talking _at_ me and not _with_ me. In my Openclaw experiment(s) and in the Gemini web UI, I've specifically added instructions to avoid this type of behavior, but it only seemed to obey those rules when I reminded the model of them.
For conversational contexts, I don't think the (in some cases significantly) better benchmark results compared to a model like Sonnet 4.6 can convince me to switch to Gemini 3.1. Has anyone else had a similar experience, or is this just a me issue?
augusto-moura 6 hours ago
Gemini sounds less personal, but I think that is good. From my experience, the quality of response is much higher than ChatGPT or Grok, and it cites real sources. I want to have a mini-wikipedia response for my questions, not a friend's group chat response
gavinray 5 hours ago
I have the opposite viewpoint:
If a model doesn't optimize the formatting of its output display for readability, I don't want to read it.
Tables, embedded images, use of bulleted lists and bold/italicizing etc.
staticman2 5 hours ago
I'm not familiar with Openclaw and but the trick to solve this would be to embed a style reminder at the bottom of each user message and ideally hide that from the user with the UI.
This is how roleplay apps like Sillytavern customize the experience for power users by allowing hidden style reminders as part of the user message that accompany each chat message.
InkCanon 6 hours ago
I think they all output that bold lettering, point by point style output. I strongly suspect it's part of a synthetic data pipeline all these AI companies have, and it improves performance. Claude seems to be the least of them, but it will start writing code at the drop of a hat. What annoys me in Gemini is that it has a really strange tendency to come up with weird analogies, especially in Pro mode. You'll be asking it about something like red black trees and it'll say "Red Black Trees (The F1 of Tree Data Structures)".
hydrolox 4 hours ago
Yes, the analogy habit is the most annoying of all. Overall formatting for me is doable, if it didn't divide up an answer into these silly arbitrary categories with useless analogies. I've tried adding in my user preferences to never use analogies but it inevitably falls back into that habit.
losvedir 4 hours ago
It definitely has the worst "voice" in my opinion. Feels very overachieving McKinsey intern to me.
markab21 6 hours ago
You just articulated why I struggle to personally connect with Gemini. It feels so unrelatable and exhausting to read its output. I prefer to read Opus/Deepseek/GLM over Gemini, Qwen and the open source GPT models. Maybe it is RLHF that is creating my distaste from using it. (I pay for Gemini; I should be using it more... but the outputs just bug me and feel more work to get actionable insight.)
verdverm 5 hours ago
I have no issues adjusting gemini tone & style with system prompt content
mbh159 2 hours ago
77.1% on ARC-AGI-2 and still can't stop adding drive-by refactors. ARC-AGI-2 tests novel pattern induction, it's genuinely hard to fake and the improvement is real. But it doesn't measure task scoping, instruction adherence, or knowing when to stop. Those are the capabilities practitioners actually need from a coding agent. We have excellent benchmarks for reasoning. We have almost nothing that measures reliability in agentic loops. That gap explains this thread.
veselin 4 hours ago
I am actually going to complain about this: that neither of the Gemini models are not preview ones.
Anthropic seems the best in this. Everything is in the API on day one. OpenAI tend to want to ask you for subscription, but the API gets there a week or a few later. Now, Gemini 3 is not for production use and this is already the previous iteration. So, does Google even intent to release this model?
WarmWash 6 hours ago
It seems google is having a disjointed roll out, and there will likely be an official announcement in a few hours. Apparently 3.1 showed up unannounced in vertex at 2am or something equally odd.
Either way early user tests look promising.
XCSme 4 hours ago
Gets 10/10 on my potato benchmarks: https://aibenchy.com/model/google-gemini-3-1-pro-preview-med...
XCSme 3 hours ago
Now I need to write more tests.
It's a bit hard to trick reasoning models, because they explore a lot of the angles of a problem, and they might accidentally have an "a-ha" moment that leads them on the right path. It's a bit like doing random sampling and stumbling upon the right result after doing gradient descent from those points.
thevinter 2 hours ago
Are you intentionally keeping the benchmarks private?
XCSme 2 hours ago
Yes.
I am trying to think what's the best way to give most information about how the AI models fail, without revealing information that can help them overfit on those specific tests.
I am planning to add some extra LLM calls, to summarize the failure reason, without revealing the test.
XCSme 2 hours ago
Added one more test, which surprisingly gemini flash 3 reasoning passes, but gemini 3.1 pro not
pawelduda 5 hours ago
It's safe to assume they'll be releasing improved Gemini Flash soon? The current one is so good & fast I rarely switch to pro anymore
derac 5 hours ago
When 3 came out they mentioned that flash included many improvements that didn't make it into pro (via an hn comment). I imagine this release includes those.
tucnak 4 hours ago
Gemini 3 Pro (high) is a joke compared to Gemini 3 Flash in Antigravity, except it's not even funny. Flash is insane value, and super capable, too. I've had it implement a decompiler for very obscure bytecode, and it was passing all tests in no time. PITA to refactor later, but not insurmountable. Gemini 3 Pro (high) choked on this problem in the early stages... I'm looking forward to comparing 3.1 Pro vs 3.0 Flash, hopefully they have improved on it enough to finally switch over.
upmind 4 hours ago
In my experience, while Gemini does really well in benchmarks I find it much worse when I actually use the model. It's too verbose / doesn't follow instructions very well. Let's see if that changes with this model.
fdefitte 3 hours ago
The benchmark jumps are impressive but the real question is whether Gemini can stop being so aggressively helpful. Every time I use it for coding it refactors stuff I didn't ask it to touch. Claude has the opposite problem where it sometimes does too little. Feels like nobody has nailed the "do exactly what I asked, nothing more" sweet spot yet.
0xcb0 3 hours ago
I'm trying to find the information, is this available on the Gemini CLI script, or is this just the web front-end where I can use this new model?
getcrunk an hour ago
Gemini is so stubborn, and often doesn’t follow explicit and simple instructions. So annoying
josalhor 6 hours ago
I speculated that 3 pro was 3.1... I guess I was wrong. Super impressive numbers here. Good job Google.
refulgentis 6 hours ago
> I speculated that 3 pro was 3.1
?
josalhor 5 hours ago
Sorry... I speculated that 3 deep think is 3.1 pro.. model names are confusing..
markerbrod 5 hours ago
Murfalo 5 hours ago
I like to think that all these pelican riding a bicycle comments are unwittingly iteratively creating the optimal cyclist pelican as these comment threads are inevitably incorporated in every training set.
alpineman 5 hours ago
More like half of Google's AI team is hanging out on HN, and they can optimise for that outcome to get a good rep among the dev community.
kridsdale3 3 hours ago
Hello.
(I'm not aware of anyone doing this, but GDM is quite info-siloed these days, so my lack of knowledge is not evidence it's not happening)
alpineman 3 hours ago
Barbing 4 hours ago
See: fish in bike front basket
mrcwinn 34 minutes ago
It's fascinating to watch this community react to positively to Google model releases and so negatively toward OpenAI's. You all do understand that an ad revenue model is exactly where Google will go, right?
impulser_ 5 hours ago
Seems like they actually fixed some of the problems with the model. Hallucinations rate seems to be much better. Seems like they also tuned the reasoning maybe that were they got most of the improvements from.
whynotminot 5 hours ago
The hallucination rate with the Gemini family has always been my problem with them. Over the last year they’ve made a lot of progress catching the Gemini models up to/near the frontier in general capability and intelligence, but they still felt very late 2024 in terms of hallucination rate.
Which made the Gemini models untrustworthy for anything remotely serious, at least in my eyes. If they’ve fixed this or at least significantly improved, that would be a big deal.
SubiculumCode 4 hours ago
Maybe I haven't kept up with how ghatgpt and claude are doing , but 6 monthlatelys ago or so, I thought Gemini was leading on that front.
panarchy 4 hours ago
I had it make a simple HTML/JS canvas game (think flappy bird) and while it did some things mildly better (and others noticeably worse) it still fell into the exact same traps as earlier models. It also had a lot of issues generating valid JS at parts and asking it what the code should be just made it endlessly generate the same exact incorrect code.
hn_throw2025 27 minutes ago
Yeah great, now can I have my pinned chats back please?
https://www.google.com/appsstatus/dashboard/incidents/nK23Zs...
vinhnx 6 hours ago
solarisos 4 hours ago
The speed of these 3.1 and Preview releases is starting to feel like the early days of web frameworks. It’s becoming less about the raw benchmarks and more about which model handles long-context 'hallucination' well enough to be actually used in a production pipeline without constant babysitting.
atleastoptimal 2 hours ago
Writing style wise, 3.1 seems very verbose, but somehow less creative compared to 3.
robviren 3 hours ago
I have run into a surprising number of basic syntax errors on this one. At least in the few runs I have tried it's a swing and a miss. Wonder if the pressure of the Claude release is pushing these stop gap releases.
pRusya 4 hours ago
I'm using gemini.google.com/app with AI Pro subscription. "Something went wrong" in FF, works in Chrome.
Below is one of my test prompts that previous Gemini models were failing. 3.1 Pro did a decent job this time.
> use c++, sdl3. use SDL_AppInit, SDL_AppEvent, SDL_AppIterate callback functions. use SDL_main instead of the default main function. make a basic hello world app.
jeffybefffy519 3 hours ago
Someone needs to make an actual good benchmark for LLM's that matches real world expectations, theres more to benchmarks than accuracy against a dataset.
robotpepi 3 hours ago
this reminds me of that joke of someone saying "it's crazy that we have ten different standards for doing this", and then there're 11 standards
knollimar an hour ago
Xkcd 927
casey2 2 hours ago
We don't need real world benchmarks, if they were good for real world tasks people would use them We need scientific benchmarks that tease out the nature of intelligence. There are plenty of unsaturated benchmarks. Solving chess using "mostly" language modeling is still an open problem. And beyond that creating a machine that can explain why that move is likely optimal at some depth. AI that can predict the output of another AI.
onlyrealcuzzo 5 hours ago
We've gone from yearly releases to quarterly releases.
If the pace of releases continues to accelerate - by mid 2027 or 2028 we're headed to weekly releases.
rubicon33 5 hours ago
But actual progress seems to be slower. These modes are releasing more often but aren’t big leaps.
gallerdude 5 hours ago
We used to get one annual release which was 2x as good, now we get quarterly releases which are 25% better. So annually, we’re now at 2.4x better.
minimaxir 5 hours ago
Due to the increasing difficulty of scaling up training, it appears the gains are instead being achieved through better model training which appears to be working well for everyone.
wahnfrieden 5 hours ago
GPT 5.3 (/Codex) was a huge leap over 5.2 for coding
rubicon33 11 minutes ago
azuanrb 5 hours ago
The CLI needs work, or they should officially allow third-party harnesses. Right now, the CLI experience is noticeably behind other SOTA models. It actually works much better when paired with Opencode.
But with accounts reportedly being banned over ToS issues, similar to Claude Code, it feels risky to rely on it in a serious workflow.
zokier 5 hours ago
> Last week, we released a major update to Gemini 3 Deep Think to solve modern challenges across science, research and engineering. Today, we’re releasing the upgraded core intelligence that makes those breakthroughs possible: Gemini 3.1 Pro.
So this is same but not same as Gemini 3 Deep Think? Keeping track of these different releases is getting pretty ridiculous.
WarmWash 5 hours ago
Deep Think is a few 3.1 models working together. It was suspected last week that Deep Think was composed using the new 3.1 model.
verdverm 5 hours ago
3.1 == model
deep think == turning up thinking knob (I think)
deep research == agent w/ search
syspec 4 hours ago
Does anyone know if this is in GA immediately or if it is in preview?
On our end, Gemini 3.0 Preview was very flakey (not model quality, but as in the API responses sometimes errored out), making it unreliable.
Does this mean that 3.0 is now GA at least?
mark_l_watson 6 hours ago
Fine, I guess. The only commercial API I use to any great extent is gemini-3-flash-preview: cheap, fast, great for tool use and with agentic libraries. The 3.1-pro-preview is great, I suppose, for people who need it.
Off topic, but I like to run small models on my own hardware, and some small models are now very good for tool use and with agentic libraries - it just takes a little more work to get good results.
throwaway2027 6 hours ago
Seconded. Gemini used to be trash and I used Claude and Codex a lot but gemini-3-flash-preview punches above it's weight, it's decent and I rarely if ever run into any token limit either.
verdverm 4 hours ago
Thirded, I've been using gemini-3-flash to great effect. Anytime I have something more complicated, I give it to pro & flash to see what happens. Coin flip if flash is nearly equivalent (too many moving vars to be analytical at this point)
PlatoIsADisease 6 hours ago
What models are you running locally? Just curious.
I am mostly restricted to 7-9B. I still like ancient early llama because its pretty unrestricted without having to use an abliteration.
mark_l_watson 4 hours ago
I experimented with many models on my 16G and 32G Macs. For less memory, qwen3:4b is good, for the 32B Mac, gpt-oss:20b is good. I like the smaller Mistral models like mistral:v0.3 and rnj-1:latest is a pretty good small reasoning model.
nurettin 6 hours ago
I like to ask claude how to prompt smaller models for the given task. With one prompt it was able to make a low quantized model call multiple functions via json.
mixel 6 hours ago
Google seems to really pull ahead in this AI race. For me personally they offer the best deal and although the software is not quiet there compared to openai or anthropic (in regards to 1. web GUI, 2. agent-cli). I hope they can fix that in the future and I think once Gemini 4 or whatever launches we will see a huge leap again
rubslopes 5 hours ago
I don't understand this sentiment. It may hold true for other LLM use cases (image generation, creative writing, summarizing large texts), but when it comes to coding specifically, Google is *always* behind OpenAI and Anthropic, despite having virtually infinite processing power, money, and being the ones who started this race in the first place.
Until now, I've only ever used Gemini for coding tests. As long as I have access to GPT models or Sonnet/Opus, I never want to use Gemini. Hell, I even prefer Kimi 2.5 over it. I tried it again last week (Gemini Pro 3.0) and, right at the start of the conversation, it made the same mistake it's been making for years: it said "let me just run this command," and then did nothing.
My sentiment is actually the opposite of yours: how is Google *not* winning this race?
hobofan 5 hours ago
> despite having virtually infinite processing power, money
Just because they have the money doesn't mean that they spend it excessively. OpenAI and Anthropic are both offering coding plans that are possibly severely subsidized, as they are more concerned with growth at all cost, while Google is more concerned with profitability. Google has the bigger warchest and could just wait until the other two run out of money rather than forcing the growth on that product line in unprofitable means.
Maybe they are also running much closer to their compute limits then the other ones too and their TPUs are already saturated with API usage.
jeanloolz 3 hours ago
mike97 3 hours ago
eknkc 5 hours ago
I hope they fail.
I honestly do not wish Google to have the best model out there and be forced to use their incomprehensible subscription / billing / project management whatever shit ever again.
I don’t know what their stuff cost. I don’t know why would I use vertex or ai studio. What is included in my subscription what is billed per use.
I pray that whatever they build fails and burns.
toraway 4 hours ago
For a personal plan to use premium Gemini AI features or for agentic development with Gemini CLI/Antigravity the billing is no more or less complicated then Claude Code or Codex CLI.
You pay for the $20/mo Google AI Pro plan with a credit card via the normal personal billing flow like you would for a Google One plan without any involvement of Google Cloud billing or AI Studio. Authorize in the client with your account and you're good to go.
(With the bundled drive storage on AI Pro I'm just paying a few bucks more than I was before so for me it's my least expensive AI subscription excluding the Z.ai ultra cheap plan).
Or, just like with Anthropic or OpenAI, it's a separate process for billing/credits for an API key targeted at a developer audience. Which I don't need or use for Gemini CLI or Antigravity at all, it's a one step "click link to authorize with your Google Account" and done.
You could decide to use an API key for usage based billing instead (just like you could with Claude Code) but that's entirely unnecessary with a subscription.
Sure, for the API anything involving a hyperscalar cloud is going to have a higher complexity floor with legacy cruft here and there, but for individual subscriptions that's irrelevant and it's pretty much as straightforward of a click and pay flow you'd find anywhere else.
otherme123 5 hours ago
They all suck. OpenAI ignores scanning limits and disabled routes in robots.txt, after a 429 "Too Many Requests" they retry the same url half a dozen of times from different IPs in the next couple of minutes, and they once DoS'ed my small VPS trying to do a full scan of sitemaps.xml in less than one hour, trying and retrying if any endpoint failed.
Google and others at least respects both robots.txt and 429s. They invested years scanning all the internet, so they can now train on what they have stored in their server. OpenAI seems to assume that MY resources are theirs.
dybber 5 hours ago
Eventually the models will be generally be so good that the competition moves from the best model to the best user experience and here I think we can expect others will win, e.g. Microsoft with GitHub and VS Code
eknkc 5 hours ago
hsaliak 6 hours ago
The eventual nerfing gives me pause. Flash is awesome. What we really want is gemini-3.1-flash :)
clhodapp 6 hours ago
There's a very short blog post up: https://blog.google/innovation-and-ai/models-and-research/ge...
makeavish 6 hours ago
Great model until it gets nerfed. I wish they had a higher paid tier to use non nerfed model.
Mond_ 5 hours ago
Bad news, John Google told me they already quantized it immediately after the benchmarks were done and it sucks now.
I miss when Gemini 3.1 was good. :(
spyckie2 6 hours ago
I think there is a pattern it will always be nerfed the few weeks before launching a new model. Probably because they are throwing a bunch of compute at the new model.
makeavish 5 hours ago
Yeah maybe that but atleast let us know about this Or have dynamic limits? Nerfing breaks trust. Though I am not sure if they actually nerf it intentionally. Haven't heard from any credible source. I did experience in my workflow though.
xnx 6 hours ago
What are you talking about?
quacky_batak 6 hours ago
I’m keen to know how and where are you using Gemini.
Anthropic is clearly targeted to developers and OpenAI is general go to AI model. Who are the target demographic for Gemini models? ik that they are good and Flash is super impressive. but i’m curious
jdc0589 6 hours ago
I use it as my main platform right now both for work/swe stuff, and person stuff. It works pretty well, they have the full suite of tools I want from general LLM chat, to notebookLM, to antigravity.
My main use-cases outside of SWE generally involve the ability to compare detailed product specs and come up with answers/comparisons/etc... Gemini does really well for that, probably because of the deeper google search index integration.
Also I got a year of pro for free with my phone....so thats a big part.
ggregoire 4 hours ago
I use it in Google Search. For example yesterday I typed in Google "postgres generate series 24 hour" and this morning "ffmpeg convert mp4 to wav". Previously I would have clicked on the first StackOverflow result (RIP), now I just take it from the Gemini summary (I'd say 95% of the time it's correct for basic programming language questions. I remember some hallucinations about psycopg3 and date-fns tho. As usual with AI, you need to already know the answer, at least partially, to detect the bs).
Also what's great about Gemini in Google Search is that the answer comes with several links, I use them sometimes to validate the correctness of the solution, or check how old the solution is (I've never used chatGPT so I don't know if chatGPT does it).
hunta2097 6 hours ago
I use the Gemini web interface just as I would ChatGPT. They also have coding environment analogues of Claude-Code in Anti-gravity and Gemini-CLI.
When you sign up for the pro tier you also get 2TB of storage, Gemini for workspace and Nest Camera history.
If you're in the Google sphere it offers good value for money.
dinosor 6 hours ago
I find gemini to be the best at travel planning and for story telling of geographical places. For a road trip, I tried all three mainstream providers and I liked Gemini (also personal preference because Gemini took a verbose approach instead of bullet points from others) for it's responses, ways it discovered stories about places I wanted to explore, places it suggested for me and things it gave me to consider those places in the route.
minimaxir 6 hours ago
Gemini has an obvious edge over its competitors in one specific area: Google Search. The other LLMs do have a Web Search tool but none of them are as effective.
fatherwavelet 5 hours ago
I feel like Gemini 3 was incredible on non-software/coding research. I have learned so much systems biology the last two months it blows my mind.
I had only started using Opus 4.6 this week. Sonnet it seems like is much better at having a long conversation with. Gemini is good for knowledge retrieval but I think Opus 4.6 has caught up. The biggest thing that made Gemini worth it for me the last 3 months is I crushed it with questions. I wouldn't have even got 10% of the Opus use that I got from Gemini before being made to slow down.
I have a deep research going right now on 3.1 for the first time and I honestly have no idea how I am going to tell if it is better than 3.
It seems like agentic coding Gemini wasn't as good but just asking it to write a function, I think it only didn't one shot what I asked it twice. Then fixed the problem on the next prompt.
I haven't logged in to bother with chatGPT in about 3 months now.
mehagar 5 hours ago
I use Gemini for personal stuff such as travel planning and research on how to fix something, which product to buy, etc. My company has as Pro subscription so I use that instead of ChatGPT.
jug 6 hours ago
I personally use it as my general purpose and coding model. It's good enough for my coding tasks most of the time, has very good and rapid web search grounding that makes the Google index almost feel like part of its training set, and Google has a family sharing plan with individual quotas for Google AI Pro at $20/month for 5 users which also includes 2 TB in the cloud. Family sharing is a unique feature for Gemini 3 Flash Thinking (300 prompts per day and user) & Pro (100 prompts per day and user).
dekhn 5 hours ago
I am a professional software developer who has been programming for 40 years (C, C++, Python, assembly, any number of other languages). I work in ML (infrastructure, not research) and spent a decade working at Google.
In short, I consider Gemini to be a highly capable intern (grad student level) who is smarter and more tenacious than me, but also needs significant guidance to reach a useful goal.
I used Gemini to completely replace the software stack I wrote for my self-built microscope. That includes:
writing a brand new ESP32 console application for controlling all the pins of my ESP32 that drives the LED illuminator. It wrote the entire ESP-IDF project and did not make any major errors. I had to guide with updated prompts a few times but otherwise it wrote the entire project from scratch and ran all the build commands, fixing errors along the way. It also easily made a Python shared library so I can just import this object in my Python code. It saved me ~2-3 days of working through all the ESP-IDF details, and did a better job than I would have.
writing a brand new C++-based Qt camera interface (I have a camera with a special SDK that allows controlling strobe and trigger and other details. It can do 500FPS). It handled all the concurrency and message passing details. I just gave it the SDK PDF documentation for the camera (in mixed english/chinese), and asked it to generate an entire project. I had to spend some time guiding it around making shared libraries but otherwise it wrote the entire project from scratch and I was able to use it to make a GUI to control the camera settings with no additional effort. It ran all the build commands and fixed errors along the way. Saved me another 2-3 days and did a better job than I could have.
Finally, I had it rewrite the entire microscope stack (python with qt) using the two drivers I described above- along with complex functionality like compositing multiple images during scanning, video recording during scanning, mesaurement tools, computer vision support, and a number of other features. This involved a lot more testing on my part, and updating prompts to guide it towards my intended destination (fully functional replacement of my original self-written prototype). When I inspect the code, it definitely did a good job on some parts, while it came up with non-ideal solutions for some problems (for example, it does polling when it could use event-driven callbacks). This saved literally weeks worth of work that would have been a very tedious slog.
From my perspective, it's worked extremely well: doing what I wanted in less time than it would take me (I am a bit of a slow programmer, and I'm doing this in hobby time) and doing a better job (With appropriate guidance) than I could have (even if I'd had a lot of time to work on it). This greatly enhances my enjoyment of my hobby by doing tedious work, allowing me to spend more time on the interesting problems (tracking tardigrades across a petri dish for hours at a time). I used gemini pro 3 for this- it seems to do better than 2.5, and flash seemed to get stuck and loop more quickly.
I have only lightly used other tools, such as ChatGPT/Codex and have never used Claude. I tend to stick to the Google ecosystem for several reasons- but mainly, I think they will end up exceeding the capabilities of their competitors, due to their inherent engineering talent and huge computational resources. But they clearly need to catch up in a lot of areas- for example, the VS Code Gemini extension has serious problems (frequent API call errors, messed up formatting of code/text, infinite loops, etc).
aberoham 4 hours ago
Wow, you have to try claude code with Opus-4.6..
dekhn 4 hours ago
thornewolf 4 hours ago
I have swapped to using gemini over chatgpt for casual conversation and question answering. there are some lacking features in the app but i get faster and more intelligent responses.
epolanski 6 hours ago
Various friends of mine work in non-technology companies (banking, industries, legal, Italy) and in pretty much all of them there's Gemini enterprise + NotebookLM.
In all of them the approach is: this is the solution, now find problems you can apply it to.
esafak 6 hours ago
I'd use it for planning, knowledge, and anything visual.
verdverm 5 hours ago
I use gemini for everything because I trust google to keep the data I send them safe, because they know how to run prod at scale, and they are more environmentally friendly than everyone else (tpu,us-central1).
This includes my custom agent / copilot / cowork (which uses vertex ai and all models therein). This is where I do more searching now (with genAi grounding) I'm about to work on several micro projects that will hold Ai a little differently.
All that being said, google Ai products suck hard. I hate using every one of them. This is more a reflection on the continued degradation of PM/Design at Big G, from before Ai, but accellationally worse since. I support removing Logan from the head of this shit show
disclaimer: long time g-stan, not so stan any more
denysvitali 6 hours ago
Where is Simon's pelican?
Mashimo 9 minutes ago
It's also quite impressive with SVG animations.
> Create an SVG animation of a Beaver sitting next to a recordplayer and a create of records, his eyes follows the mouse curser.
codethief 5 hours ago
Not Simon's but here is one: https://news.ycombinator.com/item?id=47075709
denysvitali 5 hours ago
Thank you!
saberience 6 hours ago
Please no, let's not.
jdthedisciple an hour ago
Why should I be excited?
__jl__ 6 hours ago
Another preview release. Does that mean the recommended model by Google for production is 2.5 Flash and Pro? Not talking about what people are actually doing but the google recommendation. Kind of crazy if that is the case
seizethecheese 5 hours ago
I use Gemini flash lite in a side project, and it’s stuck on 2.5. It’s now well behind schedule. Any speculation as to what’s going on?
foruhar 5 hours ago
Gemini-3.0-flash-preview came out right away with the 3.0 release and I was expecting 3.0-flash-lite before a bump on the pro model. I wonder if they have abandoned that part of the Pareto/price-performance.
yuvalmer 4 hours ago
Gemini 3.0 Pro is bad model for its class. I really hope 3.1 is a leap forward.
eric15342335 6 hours ago
My first impression is that the model sounds slightly more human and a little more praising. Still comparing the ability.
kuprel 3 hours ago
Why don't they show Grok benchmarks?
andxor an hour ago
They've fallen way behind.
kuprel an hour ago
GPT 5.2 loses at everything but they included that
matrix2596 6 hours ago
Gemini 3.1 Pro is based on Gemini 3 Pro
skerit 5 hours ago
Lol, and this line:
> Geminin 3.1 Pro can comprehend vast datasets
Someone was in a hurry to get this out the door.
1024core 5 hours ago
It's been hugged to death. I keep getting "Something went wrong".
msavara 6 hours ago
Somehow doesn't work for me :) "An internal error has occurred"
trilogic 4 hours ago
Humanity last exam 44%, Scicode 59, and that 80, and this 78 but not 100% ever.
Would be nice to see that this models, Plus, Pro, Super, God mode can do 1 Bench 100%. I am missing smth here?
PunchTornado 6 hours ago
The biggest increase is LiveCodeBench Pro: 2887. The rest are in line with Opus 4.6 or slightly better or slightly worse.
shmoogy 6 hours ago
but is it still terrible at tool calls in actual agentic flows?
naiv 6 hours ago
ok , so they are scared that 5.3 (pro) will be released today/tomorrow and blow it out of the water and rushed it while they could still reference 5.2 benchmarks.
PunchTornado 5 hours ago
I don't think models blow other models anymore. We have the big 3 which are neck to neck in most benchmarks and the rest. I doubt that 5.3 will blow the others.
scld 5 hours ago
easy now
Topfi 6 hours ago
Appears the only difference to 3.0 Pro Preview is Medium reasoning. Model naming has long gone from even trying to make sense, but considering 3.0 is still in preview itself, increasing the number for such a minor change is not a move in the right direction.
GrayShade 6 hours ago
Maybe that's the only API-visible change, saying nothing about the actual capabilities of the model?
xnx 6 hours ago
> increasing the number for such a minor change is not a move in the right direction
A .1 model number increase seems reasonable for more than doubling ARC-AGI 2 score and increasing so many other benchmarks.
What would you have named it?
Topfi 4 hours ago
My issue is that we haven't even gotten the release version of 3.0, that is also still in Preview, so may stick with 3.0 till that has been deemed stable.
Basically, what does the word "Preview" mean, if newer releases happen before a Preview model is stable? In prior Google models, Preview meant that there'd still be updates and improvements to said model prior to full deployment, something we saw with 2.5. Now, there is no meaning or reason for this designation to exist if they forgo a 3.0 still in Preview for model improvements.
xnx 4 hours ago
argsnd 6 hours ago
I disagree. Incrementing the minor number makes so much more sense than “gemini-3-pro-preview-1902” or something.
jannyfer 6 hours ago
According to the blog post, it should be also great at drawing pelicans riding a bicycle.
LZ_Khan 5 hours ago
biggest problem is that it's slow. also safety seems overtuned at the moment. getting some really silly refusals. everything else is pretty good.
makeavish 6 hours ago
I hope to have great next two weeks before it gets nerfed.
unsupp0rted 6 hours ago
I've found Google (at least in AI Studio) are the only provider NOT to nerf their models after a few weeks
makeavish 5 hours ago
I don't use AI studio for my work. I used Antigravity/Gemini CLI and 3 pro was great for few weeks and now it's worse than 3 flash or any smaller model from competitor which are rated lower on benchmarks
scrlk 5 hours ago
IME, they definitely nerf models. gemini-2.5-pro-exp-03-25 through AI Studio was amazing at release and steadily degraded. The quality started tanking around the time they hid CoT.
mustaphah 6 hours ago
Google is terrible at marketing, but this feels like a big step forward.
As per the announcement, Gemini 3.1 Pro score 68.5% on Terminal-Bench 2.0, which makes it the top performer on the Terminus 2 harness [1]. That harness is a "neutral agent scaffold," built by researchers at Terminal-Bench to compare different LLMs in the same standardized setup (same tools, prompts, etc.).
It's also taken top model place on both the Intelligence Index & Coding Index of Artificial Analysis [2], but on their Agentic Index, it's still lagging behind Opus 4.6, GLM-5, Sonnet 4.6, and GPT-5.2.
---
[1] https://www.tbench.ai/leaderboard/terminal-bench/2.0?agents=...
saberience 5 hours ago
Benchmarks aren't everything.
Gemini consistently has the best benchmarks but the worst actual real-world results.
Every time they announce the best benchmarks I try again at using their tools and products and each time I immediately go back to Claude and Codex models because Google is just so terrible at building actual products.
They are good at research and benchmaxxing, but the day to day usage of the products and tools is horrible.
Try using Google Antigravity and you will not make it an hour before switching back to Codex or Claude Code, it's so incredibly shitty.
mustaphah 5 hours ago
That's been my experience too; can't disagree. Still, when it comes to tasks that require deep intelligence (esp. mathematical reasoning [1]), Gemini has consistently been the best.
gregorygoc 5 hours ago
What’s so shitty about it?
BMFXX 3 hours ago
Just wish iI could get 2.5 daily limit above 1000 requests easily. Driving me insane...
ChrisArchitect 4 hours ago
More discussion: https://news.ycombinator.com/item?id=47075318
lysecret 4 hours ago
Please I need 3 in ga…
nautilus12 5 hours ago
Ok, why don't you work on getting 3.0 out of preview first? 10 min response time is pretty heinous
mucai82 5 hours ago
I agree, according to Googles terms you are not allowed to use the preview model for production use cases. And 3.0 has been in preview for a loooong time now :(
jeffbee 5 hours ago
Relatedly, Gemini chat seems to be if not down then extremely slow.
ETA: They apparently wiped out everyone's chats (including mine). "Our engineering team has identified a background process that was causing the missing user conversation metadata and has successfully stopped the process to prevent further impact." El Mao.
sergiotapia 5 hours ago
To use in OpenCode, you can update the models it has:
opencode models --refresh
Then /models and choose Gemini 3.1 ProYou can use the model through OpenCode Zen right away and avoid that Google UI craziness.
---
It is quite pricey! Good speed and nailed all my tasks so far. For example:
@app-api/app/controllers/api/availability_controller.rb
@.claude/skills/healthie/SKILL.md
Find Alex's id, and add him to the block list, leave a comment
that he has churned and left the company. we can't disable him
properly on the Healthie EMR for now so
this dumb block will be added as a quick fix.
Result was: 29,392 tokens
$0.27 spent
So relatively small task, hitting an API, using one of my skills, but a quarter. Pricey!gbalduzzi 5 hours ago
I don't see it even after refresh. Are you using the opencode-gemini-auth plugin as well?
sergiotapia 5 hours ago
No I am not just vanilla OpenCode. I do have OpenCode Zen credits, and I did opencode login whatever their command is to auth against opencode itself. Maybe that's the reason I see these premium models.
cmrdporcupine 6 hours ago
Doesn't show as available in gemini CLI for me. I have one of those "AI Pro" packages, but don't see it. Typical for Google, completely unclear how to actually use their stuff.
dude250711 6 hours ago
I hereby allow you to release models not at the same time as your competitors.
sigmar 6 hours ago
It is super interesting that this is the same thing that happened in November (ie all labs shipping around the same week 11/12-11/23).
zozbot234 5 hours ago
They're just throwing a big Chinese New Year celebration.
himata4113 4 hours ago
The visual capabilities of this model are frankly kind of ridicioulus what the hell.
johnwheeler 5 hours ago
I know Google has anti-gravity but do they have anything like Claude code as far as user interface terminal basically TUI?
leecommamichael 4 hours ago
Whoa, I think Gemini 3 Pro was a disappointment, but Gemini 3.1 Pro is definitely the future!
ChrisArchitect 6 hours ago
saberience 6 hours ago
I always try Gemini models when they get updated with their flashy new benchmark scores, but always end up using Claude and Codex again...
I get the impression that Google is focusing on benchmarks but without assessing whether the models are actually improving in practical use-cases.
I.e. they are benchmaxing
Gemini is "in theory" smart, but in practice is much, much worse than Claude and Codex.
rocho 2 hours ago
I find Gemini is outstanding at reasoning (all topics) and architecture (software/system design). On the other hand, Gemini CLI sucks and so I end up using Claude Code and Codex CLI for agentic work.
However, I heavily use Gemini in my daily work and I think it has its own place. Ultimately, I don't see the point of choosing the one "best" model for everything, but I'd rather use what's best for any given task.
konart 5 hours ago
> but without assessing whether the models are actually improving in practical use-cases
Which cases? Not trying to sound bad but you didn't even provide of cases you are using Claude\Codex\Gemini for.
skerit 6 hours ago
I'm glad someone else is finally saying this, I've been mentioning this left and right and sometimes I feel like I'm going crazy that not more people are noticing it.
Gemini can go off the rails SUPER easily. It just devolves into a gigantic mess at the smallest sign of trouble.
For the past few weeks, I've also been using XML-like tags in my prompts more often. Sometimes preferring to share previous conversations with `<user>` and `<assistant>` tags. Opus/Sonnet handles this just fine, but Gemini has a mental breakdown. It'll just start talking to itself.
Even in totally out-of-the-ordinary sessions, it goes crazy. After a while, it'll start saying it's going to do something, and then it pretends like it's done that thing, all in the same turn. A turn that never ends. Eventually it just starts spouting repetitive nonsense.
And you would think this is just because the bigger the context grows, the worse models tend to get. But no! This can happen well below even the 200.000 token mark.
reilly3000 4 hours ago
Flash is (was?) was better than Pro on these fronts.
user34283 5 hours ago
I exclusively use Gemini for Chat nowadays, and it's been great mostly. It's fast, it's good, and the app works reliably now. On top of that I got it for free with my Pixel phone.
For development I tend to use Antigravity with Sonnet 4.5, or Gemini Flash if it's about a GUI change in React. The layout and design of Gemini has been superior to Claude models in my opinion, at least at the time. Flash also works significantly faster.
And all of it is essentially free for now. I can even select Opus 4.6 in Antigravity, but I did not yet give it a try.
cmrdporcupine 5 hours ago
Honestly doesn't feel like Google is targeting the agentic coding crowd so much as they are the knowledge worker / researcher / search-engine-replacement market?
Agree Gemini as a model is fairly incompetent inside their own CLI tool as well as in opencode. But I find it useful as a research and document analysis tool.
verdverm 3 hours ago
For my custom agentic coding setup, I use Claude Code derived prompts with Gemini models, primarily flash. It's night and day compared to Google's own agentic products, which are all really bad.
The models are all close enough on the benchmarks and I think people are attributing too much difference in the agentic space to the model itself. I strongly believe the difference is in all the other stuff, which is why Antropic is far ahead of the competition. They have done great work with Claude Code, Cowork, and their knowledge share through docs & blog, bar none on this last point imo.
throwaw12 5 hours ago
Can we switch from Claude Code to Google yet?
Benchmarks are saying: just try
But real world could be different
foruhar 4 hours ago
My sense is that the Gemini models are very capable but the Gemini CLI experience is subpar compared to Claude Code and Codex. I'm guess that it's the harness but since it can get confused, fall into doom loops, and generally lose the plot in a way that the model does not in Gemini Studio or the Gemini app.
I think a bunch of these harnesses are open source so it surprises me that there can be such a gulf between them.
dana321 an hour ago
Its not just subpar, its not even sub-sub-par.
It goes into loops and never completes a task 8 times out of 10 that i've used it.
cmrdporcupine 3 hours ago
It's not just the tooling. If you use Gemini in opencode it malfunctions in similar ways.
I haven't tried 3.1 yet, but 3 is just incompetent at tool use. In particular in editing chunks of text in files, it gets very confused and goes into loops.
The model also does this thing where it degrades into loops of nonsense thought patterns over time.
For shorter sessions where it's more analysis than execution, it is a strong model.
We'll see about 3.1. I don't know why it's not showing in my gemini CLI as available yet.
pickle-pixel 3 hours ago
does it still crash out after couple prompts?
taytus 2 hours ago
Another preview model? Why google keep doing this?
Filip_portive 3 hours ago
My new comment
techgnosis 5 hours ago
I'd love a new Gemini agent that isn't written with Node.js. Not sure why they think that's a good distribution model.
CamperBob2 4 hours ago
(Shrug) Ask it to write one!
jcims 5 hours ago
Pelican on a bicycle in drawio - https://imgur.com/a/tNgITTR
(FWIW I'm finding a lot of utility in LLMs doing diagrams in tools like drawio)
pqdbr 5 hours ago
How are you prompting it to draw diagrams in drawio
jcims 4 hours ago
Sometimes it helps to also provide a drawio file that has the elements you wan't (eg. cloud service icons or whatever), but you just feed it the content you want diagrammed and let it eat.
Even if it's not completely correct, it usually creates something that's much closer to complete than a blank page.
ac29 5 hours ago
Drawio drawings are just XML, its possible it can generate that directly