Models Keep Getting Stronger, but 'Strongest' Has No Single Answer
Top models tie on GPQA at 92–94%, yet real-world results diverge. Three skills now split: hard problems, rough tasks, open exploration; the last is hardest to measure.
7 min read
1 post
Top models tie on GPQA at 92–94%, yet real-world results diverge. Three skills now split: hard problems, rough tasks, open exploration; the last is hardest to measure.