guanjiawei.aiJiawei Guan's Personal Site

Tags:

Evals

1 post

AI LLM Evals Reinforcement Learning Reflections

Models Keep Getting Stronger, but 'Strongest' Has No Single Answer

Top models tie on GPQA at 92–94%, yet real-world results diverge. Three skills now split: hard problems, rough tasks, open exploration; the last is hardest to measure.

June 3, 20267 min read