Claude Sonnet 4.5 is Anthropic's Best Model Yet - My Testing vs. GPT-5

Claude Sonnet 4.5 is Anthropic's Best Model Yet - My Testing vs. GPT-5

Onur (Honor)
Onur (Honor)
2025-10-06 • 5 min read

I've been using Claude as my main AI tool for over a year now. I've written about previous versions, tested every major release, and it's still my daily driver. When Anthropic dropped Sonnet 4.5 on September 29th, I cleared my afternoon and started testing. Two weeks later, I have opinions.

The short version: hell yeah to pretty much everything. It's faster, smarter, and somehow less annoying than before. But let me give you the full picture.

Why Claude is Still My Daily Driver

Here's the thing about Claude that's hard to explain until you use it daily: it gets what you mean. Not just what you typed - what you actually meant.

I can write a half-formed thought, full of assumptions and shortcuts, and Claude figures it out. GPT-5 is smart, but sometimes it takes my words too literally. Claude reads between the lines.

Is that quantifiable? No. Does it matter when you're cranking through real work? Absolutely.

The Benchmarks That Actually Matter

Let's talk numbers, but only the ones that translate to real work.

Coding: 77.2% on SWE-bench Verified. That's the test where AI has to fix actual bugs from real GitHub repos—not toy problems, real code. GPT-5 Codex hits 74.5%. Close, but Claude edges it out.

More importantly: Replit went from a 9% error rate on Sonnet 4 to 0% on their code editing benchmark. Zero percent. That's the kind of improvement you actually feel when you're debugging at 11pm.

Computer use: 61.4% on OSWorld, up from 42.2% just four months ago. That's Claude clicking around in actual software, filling forms, navigating websites. The improvement here is wild.

Long tasks: Claude can now maintain focus for over 30 hours on complex, multi-step tasks. That's not a typo. Thirty hours of sustained work on one problem without losing the thread. This is what makes agentic AI actually useful—the model doesn't forget what it was doing.

Robot calmly juggling laptop, mouse, and magnifying glass while relaxing - showing effortless handling of complex tasks

My Actual Testing: Claude vs GPT-5

Benchmarks are one thing. Using them for real work is another. Here's what I noticed after two weeks of throwing actual tasks at both.

Business writing: Claude wins. The tone is more natural, less "AI voice." GPT-5 writes competent copy, but Claude writes copy that sounds like a human wrote it. I spend less time editing.

Code debugging: Claude wins, especially on complex problems. When I need to understand why something's breaking across multiple files, Claude traces through the logic better. GPT-5 sometimes fixes the symptom instead of the root cause.

Quick code generation: Tie. Both are excellent at cranking out functions, components, scripts. GPT-5 might be slightly faster on simple stuff.

Complex architecture planning: Claude wins with extended thinking enabled. When I need to map out how pieces fit together, Claude's reasoning mode produces better structured analysis.

Web research: GPT-5 wins. Deep Research is still better for pulling information from across the web. Claude's web access feels more limited.

Image generation: GPT-5 by default - Claude doesn't generate images at all.

Two toolboxes side by side - one organized with essential tools, one overflowing with random toys - showing Claude vs GPT-5 approach

The Annoying Stuff (It's Better Now)

Look, I love Claude, but previous versions had some habits that drove me nuts.

The sycophancy. "You're absolutely right!" before every response. "That's a great question!" No, it's not a great question, just answer it.

Anthropic specifically called this out: Sonnet 4.5 is their "most aligned frontier model," with "large improvements across several areas of alignment" including reduced sycophancy, deception, and power-seeking behaviors.

After two weeks, I can confirm: it's noticeably better. Claude still has personality - it's not cold or robotic - but it's dropped the constant validation. It feels more like talking to a confident colleague than a people-pleaser.

The overengineering is also improved. Ask Claude to build something simple, and previous versions would sometimes architect a spaceship when you needed a bicycle. Sonnet 4.5 is more calibrated to the actual scope of what you're asking for.

What This Costs

API pricing stayed the same: $3 per million input tokens, $15 per million output. No price increase despite significant improvements. GPT-5 is actually cheaper - about $1.25/$10 per million tokens - but you get what you pay for.

For most people, you're not paying per token anyway. Both Claude Pro and ChatGPT Plus are $20/month. Same price, different tools. At this point, try both and see which one clicks with how you work.

The 200K token context window is plenty for most business use. GPT-5's 400K window is bigger on paper, but I rarely hit limits with either.

The Real-World Results

It's not just me seeing improvements. The companies actually building on Claude are reporting serious gains:

Cognition's Devin (the AI coding agent) saw an 18% increase in planning performance and 12% in end-to-end scores - "the biggest jump since Claude Sonnet 3.6."

GitHub Copilot, Cursor, Replit - they're all either switching to or keeping Claude for their AI features. When the companies building developer tools choose Claude for their own products, that tells you something.

Bottom Line: My Honest Scorecard

CategoryWinnerMargin
Business writingClaudeClear
Code debuggingClaudeClear
Quick code generationTie-
Architecture planningClaudeClear
Web researchGPT-5Clear
Image generationGPT-5N/A (Claude can't)
Long-running tasksClaudeClear
Understanding intentClaudeClear
Price (API)GPT-5~50% cheaper
Price (subscription)TieBoth $20/mo

Claude Sonnet 4.5 is a beast. It's noticeably better than Sonnet 4 at everything I use it for. The sycophancy is down, the code quality is up, and it finally stopped over-engineering everything.

GPT-5 is still excellent - especially for web research and anything involving images. And yes, OpenAI tends to overhype their releases (I said as much when GPT-5 launched). But it's a legitimately great model.

For my workflow? Claude is still the main driver. GPT-5 is the specialist I call in when I need its specific strengths.

Should You Care About Any of This?

If you're a small business owner wondering whether to pay attention to AI model releases: probably not at this level of detail. Both Claude and ChatGPT are excellent. Pick one, learn it well, and stop worrying about benchmarks.

If you're already using AI tools for work and want to optimize: try Claude Pro for a month. The $20 is worth it to see if it clicks with your workflow. For code, writing, and analysis work, Sonnet 4.5 is the best I've used.

And if you're still confused about where AI fits into your business at all, that's a different conversation—one that's worth having with someone who can look at your actual situation, not a benchmark table. Drop me a line if you want to talk through it.

Filed under:
Onur

Written by Onur

I'm Onur. I build software for Central Coast small businesses. When your website breaks, when you need a custom tool, when tech gets confusing—I'm the guy you call. I answer the phone, I explain things without the jargon, and I build things that actually work. No AI hype, no endless meetings, just practical solutions using technology that's been around long enough to be reliable.