The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding
Mo Yu, Lemao Liu, Junjie Wu, Tsz Ting Chung, Shunchi Zhang, Jiangnan Li, Dit-Yan Yeung, Jie Zhou
TL;DR
This work introduces PhysiCo, a dual-branch summative assessment to quantify LLMs' understanding of physical concepts, contrasting low-level memorization with high-level abstract reasoning framed in grid representations inspired by ARC. Across a broad suite of models, the study shows LLMs reach near-human accuracy on text-based concept recognition but lag humans by about $\sim40\%$ on grid-based high-level tasks, demonstrating a robust stochastic parrot phenomenon. Additional experiments reveal that neither in-context learning nor modest fine-tuning on grid-format data improves high-level performance, suggesting intrinsic limitations in deep understanding beyond formatting or data exposure. The findings have implications for evaluating and advancing AI systems' world-modeling and embodied reasoning capabilities, highlighting the need for new training regimes or architectures to achieve substantive conceptual understanding. PhysiCo thus provides a quantitative framework for detecting and characterizing the gap between memorization and genuine understanding in LLMs.
Abstract
In a systematic way, we investigate a widely asked question: Do LLMs really understand what they say?, which relates to the more familiar term Stochastic Parrot. To this end, we propose a summative assessment over a carefully designed physical concept understanding task, PhysiCo. Our task alleviates the memorization issue via the usage of grid-format inputs that abstractly describe physical phenomena. The grids represents varying levels of understanding, from the core phenomenon, application examples to analogies to other abstract patterns in the grid world. A comprehensive study on our task demonstrates: (1) state-of-the-art LLMs, including GPT-4o, o1 and Gemini 2.0 flash thinking, lag behind humans by ~40%; (2) the stochastic parrot phenomenon is present in LLMs, as they fail on our grid task but can describe and recognize the same concepts well in natural language; (3) our task challenges the LLMs due to intrinsic difficulties rather than the unfamiliar grid format, as in-context learning and fine-tuning on same formatted data added little to their performance.
