Gemini 3 Pro, Opus 4.5 or GPT-5.1?
    Created: 2025-12-07
    OpenAI
    Google Gemini
    Anthropic

    Quick overview

    This article explores the differences I've observed between the latest LLMs. Is there one model to rule them all, or are they all equal? I'll share my thoughts after extensively testing Gemini 3 Pro, Opus 4.5, and GPT-5.1.

    Benchmarks vs reality

    When a new LLM is introduced, companies often create eye-catching diagrams to ensure it gets noticed, highlighting how their brand-new model beats existing state-of-the-art models on benchmarks. This is simply advertising.

    But what about real-world usage? Do those benchmarks really matter when you want to use an LLM for your daily tasks? And is there truly a "best" model for everything?

    To understand why my experience differs from the charts, we first need to look at the purpose of benchmarks themselves.

    Why are benchmarks out there?

    To be clear, benchmarks are useful, especially when it comes to algorithms. It is a really good idea to have benchmarks for systems that behave deterministically. However, LLMs are not deterministic; they are probabilistic, which completely changes the game. That said, benchmarks aren't useless for LLMs—they can give you a general idea of how a model performs on certain tasks—but they should not be treated as the ultimate truth.

    What matters to me in real life usage?

    When I use an LLM for my daily tasks, I prioritize a few key aspects:

    • Accuracy: This is self-explanatory. I don't want to use a model that confidently provides wrong answers.
    • Coherence: How much context a model needs heavily depends on the task. When working on frontend code, does the model actually need to know every best practice for typography, colors, and spacing? Or is it enough for it to simply mimic your existing style? If you expect creativity, coherence is less critical. But if you want the model to follow your specific coding style, coherence is crucial. This also touches on "known unknowns" versus "unknown unknowns." As a human, you might treat React best practices as a known unknown—you know you don't know them all. A model, however, might treat them as unknown unknowns, unaware they even exist. Therefore, coherence is vital for guiding the model in the right direction.
    • Tools usage: Does the model call MCP tools where needed? Or does it execute terminal commands twice or get stuck in a loop? I had significant issues with Gemini 2.5 Pro—whenever it needed to create a new file, it would fail (the log would say "file xxx.txt created," but the file wouldn't actually exist). This didn't matter for tasks like drafting or grammar checking, but it rendered the model useless for coding.
    • Usage of latest knowledge: I simply prefer when models utilize the latest patterns, libraries, and best practices. This is especially important in the software development world, where things change rapidly. A good example is working with Next.js Server Components—you gain performance benefits by using them, but if the model has no recollection of them, it will produce suboptimal code.
    • Speed: How fast the model responds. For coding, this is generally of lower importance, but waiting a minute for a simple adjustment that would take me 10 seconds is annoying. An honorable mention here goes to GPT-5.1-Codex—it is significantly slower than the others (it shines in other areas, but speed is not one of them).
    • Independence: I appreciate Gemini 3 Pro's way of working—it is capable of making its own assumptions, and often good ones. However, it starts to become annoying when you need to start everything from scratch because the model decided it has something in its "memory" even though you never mentioned it. So, this trait can be both good and bad, depending on the situation.

    I have intentionally omitted costs and so-called "creativity" here. Costs, because I utilize GitHub Copilot, where prices are hidden behind "premium requests" (so I do not have actual costs per request). Creativity, because I find it subjective, plus the same prompt can yield different results each time. On top of that, I have a feeling that the latest generation of LLMs are trained on similar data, thus yielding similar results.

    So many models, methodologies, platforms

    Navigating this landscape is becoming increasingly difficult. It’s not just about the models anymore; it’s about the methodologies (BMAD, SpecKit, OpenSpec, Agent OS, etc.) and the explosion of platforms (Cursor, Claude Code, Codex, GitHub Copilot, Antigravity, Lovable, v0, and many more). I love testing out new tools, but keeping up with all of them is a full-time job in itself.

    The race for the best model is on, and each company tries to push their own platform as the best one. This creates fragmentation and confusion. And the pace of change is so high that by the time you get used to one platform, a new model or methodology is released. If you wanted to explore all those options, you would need to spend weeks just testing and learning them. On the other hand, if you stay with the one platform you feel comfortable with, you may miss the next big thing.

    Think about what this means for the industry as a whole. Will we be forced to use multiple platforms and models? Or will one model dominate the market? Both visions are not very appealing in my opinion.

    My "internal" benchmark results

    To make things more "benchmark-like," I will try to summarize my personal experience after using Gemini 3 Pro, Opus 4.5, and GPT-5.1/GPT-5.1-Codex for a few weeks now.

    Feature / ModelGemini 3 ProOpus 4.5GPT-5.1 / GPT-5.1-Codex
    AccuracyHighHighHigh
    CoherenceMedium-HighHighHigh
    Tools usageMedium-HighHighMedium-High
    Usage of latest knowledgeHighHighHigh
    SpeedFastFastSlow
    IndependenceMediumLowLow

    This table does not reflect absolute truth, just my personal experience.

    It feels like the gap between top-tier models is narrowing, but their details and quirks make a big impact. Before the introduction of these three models, I was working mostly with Sonnet 4.5 as my main LLM for coding. Today, Opus 4.5 and GPT-5.1 would be its replacements, and only a small subset of my tasks would be handled by Gemini 3 Pro.

    Let's unpack the reasoning behind these ratings:

    • Gemini 3 Pro is really good when it comes to independence and making its own assumptions. This is great for brainstorming, editorial work, etc. But when it comes to coding, this independence can be a problem. I had multiple situations where Gemini 3 Pro just made something up, even though I never mentioned it. This can be incredibly frustrating. I do see a reason to use this one in Antigravity (it may have some unique capabilities), but for regular coding tasks—I'll pass.
    • Opus 4.5 remains somewhat of an enigma. It feels better than Sonnet 4.5 overall, e.g., thanks to the latest training data. I found small things here and there that make it a bit annoying, e.g., running commands in the terminal twice. It does not yield wrong results, but it gives a feeling that this model is not fully polished yet. Don't get me wrong—it is my "best pick" for coding tasks right now, but it does not feel like a game-changer compared to Sonnet 4.5 (benchmarks try to paint a different picture, but as I mentioned before—benchmarks are not the ultimate truth).
    • GPT-5.1 (Codex variant actually) is accurate and coherent, but its speed is a problem. I really like that this model tries to do exactly what it was asked for. It does not feel like a coding assistant, but rather a heavy LLM made for a completely different purpose that happened to end up as a developer tool. It is hard to explain, but when you use it, you will know what I mean.

    Recommendations

    I have watched some interesting YouTube videos comparing some of the methodologies. Maybe you will find them useful:

    Conclusion

    So is there "the best" model for everything?

    Unfortunately, I'll quote the dev community here—"it depends." The type of task is the most important factor here, I believe.

    I like to think about LLMs as tools that have different characteristics and vibes. Sometimes you just like that yellow-red screwdriver for no reason (that is Opus 4.5 for me), but any screwdriver will do.

    When making a small fix or a full house renovation, what you really need is a good plan. You can have the best power tools in the world, but if you don't have a solid blueprint, your renovation will fail. Similarly, an LLM is just a force multiplier for your own engineering skills. Focus on your architecture and requirements first; the specific model is just the screwdriver. If you focus too much on the "best tool" for the job, you'll miss what is really important—a good plan and execution. The same goes for LLMs.


    I’m curious—have you found a specific workflow where Gemini 3 Pro actually beats the others? Or are you sticking with Opus for now? Drop a comment below; I’d love to hear your take.

    And as always - happy coding! 🚀