yea but i feel like we are over the hill on benchmaxxing, many times a model has...

falloutx · 2026-02-05T19:33:23 1770320003

When Anthropic beats Benchmarks its somehow earned, when OpenAi games it, its somehow about not feeling good at coding.

apetresc · 2026-02-06T03:36:31 1770348991

I mean… yeah? It sounds biased or whatever, but if you actually experience all the frontier models for yourself, the conclusion that Opus just has something the others don’t is inescapable.

manmal · 2026-02-06T05:34:48 1770356088

Opus is really good at bash, and it’s damn fast. Codex is catching up on that front, but it’s still nowhere near. However, Codex is better at coding - full stop.

AstroBen · 2026-02-05T18:49:57 1770317397

'feel' is no more accurate

not saying there's a better way but both suck

thethimble · 2026-02-05T19:16:17 1770318977

Speak for yourself. I've been insanely productive with Codex 5.2.

With the right scaffolding these models are able to perform serious work at high quality levels.

helloplanets · 2026-02-05T19:34:04 1770320044

He wasn't saying that both of the models suck, but that the heuristics for measuring model capability suck

AstroBen · 2026-02-05T19:17:22 1770319042

..huh?

crorella · 2026-02-05T19:17:18 1770319038

The variety of tasks they can do and will be asked to do is too wide and dissimilar, it will be very hard to have a transversal measurement, at most we will have area specific consensus that model X or Y is better, it is like saying one person is the best coder at everything, that does not exist.

pixl97 · 2026-02-05T19:38:05 1770320285

Yea, we're going to need benchmarks that incorporate series of steps of development for a particular language and how good each model is at it.

Like can the model take your plan and ask the right questions where there appear to be holes.

How wide of architecture and system design around your language does it understand.

How does it choose to use algorithms available in the language or common libraries.

How often does it hallucinate features/libraries that aren't there.

How does it perform as context get larger.

And that's for one particular language.

tavavex · 2026-02-05T19:34:35 1770320075

The 'feel' of a single person is pretty meaningless, but when many users form a consensus over time after a model is released, it feels a lot more informative than a simple benchmark because it can shift over time as people individually discover the strong and weak points of what they're using and get better at it.

forrestthewoods · 2026-02-05T19:39:57 1770320397

At the end of the day “feel” is what people rely on to pick which tool they use.

I’d feel unscientific and broken? Sure maybe why not.

But at the end of the day I’m going to choose what I see with my own two eyes over a number in a table.

Benchmarks are a sometimes useful to. But we are in prime Goodharts Law Territory.

AstroBen · 2026-02-05T19:44:32 1770320672

yeah, to be honest it probably doesn't matter too much. I think the major models are very close in capabilities

forrestthewoods · 2026-02-05T20:26:02 1770323162

I don’t think this is even remotely true in practice.

I honestly I have no idea what benchmarks are benchmarking. I don’t write JavaScript or do anything remotely webdev related.

The idea that all models have very close performance across all domains is a moderately insane take.

At any given moment the best model for my actual projects and my actual work varies.

Quite honestly Opus 4.5 is proof that benchmarks are dumb. When Opus 4.5 released no one was particularly excited. It was better with some slightly large numbers but whatever. It took about a month before everyone realized “holy shit this is a step function improvement in usefulness”. Benchmarks being +15% better on SWE bench didn’t mean a damn thing.

karmasimida · 2026-02-05T19:57:32 1770321452

Your feeling is not my feeling, codex is unambiguously smarter model for me