Table of Contents
Fetching ...

The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding

Mo Yu, Lemao Liu, Junjie Wu, Tsz Ting Chung, Shunchi Zhang, Jiangnan Li, Dit-Yan Yeung, Jie Zhou

TL;DR

This work introduces PhysiCo, a dual-branch summative assessment to quantify LLMs' understanding of physical concepts, contrasting low-level memorization with high-level abstract reasoning framed in grid representations inspired by ARC. Across a broad suite of models, the study shows LLMs reach near-human accuracy on text-based concept recognition but lag humans by about $\sim40\%$ on grid-based high-level tasks, demonstrating a robust stochastic parrot phenomenon. Additional experiments reveal that neither in-context learning nor modest fine-tuning on grid-format data improves high-level performance, suggesting intrinsic limitations in deep understanding beyond formatting or data exposure. The findings have implications for evaluating and advancing AI systems' world-modeling and embodied reasoning capabilities, highlighting the need for new training regimes or architectures to achieve substantive conceptual understanding. PhysiCo thus provides a quantitative framework for detecting and characterizing the gap between memorization and genuine understanding in LLMs.

Abstract

In a systematic way, we investigate a widely asked question: Do LLMs really understand what they say?, which relates to the more familiar term Stochastic Parrot. To this end, we propose a summative assessment over a carefully designed physical concept understanding task, PhysiCo. Our task alleviates the memorization issue via the usage of grid-format inputs that abstractly describe physical phenomena. The grids represents varying levels of understanding, from the core phenomenon, application examples to analogies to other abstract patterns in the grid world. A comprehensive study on our task demonstrates: (1) state-of-the-art LLMs, including GPT-4o, o1 and Gemini 2.0 flash thinking, lag behind humans by ~40%; (2) the stochastic parrot phenomenon is present in LLMs, as they fail on our grid task but can describe and recognize the same concepts well in natural language; (3) our task challenges the LLMs due to intrinsic difficulties rather than the unfamiliar grid format, as in-context learning and fine-tuning on same formatted data added little to their performance.

The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding

TL;DR

This work introduces PhysiCo, a dual-branch summative assessment to quantify LLMs' understanding of physical concepts, contrasting low-level memorization with high-level abstract reasoning framed in grid representations inspired by ARC. Across a broad suite of models, the study shows LLMs reach near-human accuracy on text-based concept recognition but lag humans by about on grid-based high-level tasks, demonstrating a robust stochastic parrot phenomenon. Additional experiments reveal that neither in-context learning nor modest fine-tuning on grid-format data improves high-level performance, suggesting intrinsic limitations in deep understanding beyond formatting or data exposure. The findings have implications for evaluating and advancing AI systems' world-modeling and embodied reasoning capabilities, highlighting the need for new training regimes or architectures to achieve substantive conceptual understanding. PhysiCo thus provides a quantitative framework for detecting and characterizing the gap between memorization and genuine understanding in LLMs.

Abstract

In a systematic way, we investigate a widely asked question: Do LLMs really understand what they say?, which relates to the more familiar term Stochastic Parrot. To this end, we propose a summative assessment over a carefully designed physical concept understanding task, PhysiCo. Our task alleviates the memorization issue via the usage of grid-format inputs that abstractly describe physical phenomena. The grids represents varying levels of understanding, from the core phenomenon, application examples to analogies to other abstract patterns in the grid world. A comprehensive study on our task demonstrates: (1) state-of-the-art LLMs, including GPT-4o, o1 and Gemini 2.0 flash thinking, lag behind humans by ~40%; (2) the stochastic parrot phenomenon is present in LLMs, as they fail on our grid task but can describe and recognize the same concepts well in natural language; (3) our task challenges the LLMs due to intrinsic difficulties rather than the unfamiliar grid format, as in-context learning and fine-tuning on same formatted data added little to their performance.

Paper Structure

This paper contains 51 sections, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Illustration of a "Stochastic Parrot" by our PhysiCo task consisting of both low-level and high-level subtasks in parallel. For a concept Gravity, an LLM can generate its accurate description in natural language, but cannot interpret its grid-format illustration.
  • Figure 2: Examples of input-output grids labeled as Gravity, with increasing difficulty levels.
  • Figure 3: Overview of the research questions answered in our study and their relationships.
  • Figure 4: The prompt template used for generating descriptions of physical concepts (denoted as the variable CONCEPT) in \ref{['rq:textual_input']}.
  • Figure 5: The prompt template used for guessing the referred physical concept from four candidates (denoted as the variable CANDIDATE ANSWERS) from the natural language descriptions (denoted as the variable MASKED DESCRIPTION) in \ref{['rq:textual_input']}.
  • ...and 3 more figures