The Jumping Reasoning Curve? Tracking the Evolution of Reasoning Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles
Vernon Y. H. Toh, Yew Ken Chia, Deepanway Ghosal, Soujanya Poria
TL;DR
The paper analyzes the multimodal reasoning capabilities of GPT-[n] and o-[n] models on PuzzleVQA and AlgoPuzzleVQA, tracking performance across model generations and reasoning modes to assess progress toward generalized multimodal intelligence. By evaluating both multiple-choice and open-ended formats and conducting a bottleneck analysis, the study finds that the o-[n] series, particularly o3 and o4-mini, substantially outperform GPT-[n] models and scale with increased reasoning depth, yet仍 face persistent challenges in fine-grained visual perception and complex algorithmic or combinatorial tasks. The work expands prior benchmarks by adding open-ended assessments and a detailed error analysis, highlighting perceptual grounding and inductive reasoning as primary bottlenecks. These findings delineate concrete directions for future AGI research, emphasizing that mere architectural scaling may be insufficient without advances in robust perception and structured reasoning, and they provide open-source resources for ongoing model tracking and evaluation.
Abstract
The releases of OpenAI's o-[n] series, such as o1, o3, and o4-mini, mark a significant paradigm shift in Large Language Models towards advanced reasoning capabilities. Notably, models like o3 have demonstrated strong performance on benchmarks like the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI). However, this benchmark is limited to symbolic patterns, whereas humans often perceive and reason about multimodal scenarios involving both vision and language data. Thus, there is an urgent need to investigate advanced reasoning capabilities in multimodal tasks. To this end, we track the evolution of the GPT-[n] and o-[n] series models (including o1, o3, and o4-mini) on challenging multimodal puzzles from PuzzleVQA and AlgoPuzzleVQA, which demand fine-grained visual perception. Our results reveal that o-[n] series, particularly later iterations like o3 and o4-mini, significantly outperform the GPT-[n] series and show strong scalability in multimodal reasoning. Nonetheless, despite these substantial advancements and the superior capabilities demonstrated by the o-[n] series, our findings highlight that even these leading models face persistent challenges. Difficulties are particularly evident in tasks requiring precise visual perception, robust compositional reasoning across multiple visual attributes, and solving complex algorithmic or highly combinatorial puzzles, indicating critical areas for future AGI development. We plan to continuously track new models in the series and update our results in this paper accordingly. All resources used in this evaluation are openly available at https://github.com/declare-lab/LLM-PuzzleTest.
