Table of Contents
Fetching ...

Beyond Pattern Recognition: Probing Mental Representations of LMs

Moritz Miller, Kumar Shridhar

TL;DR

This work interrogates whether language models truly build evolving mental representations when information is revealed incrementally, rather than relying on static pattern recognition. By contrasting a stepwise Mental Modeling approach with standard full-prompt Chain-of-Thought on the MathWorld dataset across text and multimodal models, it demonstrates that LMs struggle to maintain coherent internal representations as depth increases, though vision-based mental modeling yields notable gains. The study finds modality-specific improvements, with Vision Mental Modeling outperforming Text Mental Modeling and modality switching offering potential gains, while Vision-only Inference underperforms CoT, suggesting fundamental differences from human cognition. These results motivate leveraging visual grounding and refined incremental-context strategies to enhance reasoning in future models, and they highlight the need for more robust cross-modal alignment to approach human-like incremental reasoning.

Abstract

Language Models (LMs) have demonstrated impressive capabilities in solving complex reasoning tasks, particularly when prompted to generate intermediate explanations. However, it remains an open question whether these intermediate reasoning traces represent a dynamic, evolving thought process or merely reflect sophisticated pattern recognition acquired during large scale pre training. Drawing inspiration from human cognition, where reasoning unfolds incrementally as new information is assimilated and internal models are continuously updated, we propose to delve deeper into the mental model of various LMs. We propose a new way to assess the mental modeling of LMs, where they are provided with problem details gradually, allowing each new piece of data to build upon and refine the model's internal representation of the task. We systematically compare this step by step mental modeling strategy with traditional full prompt methods across both text only and vision and text modalities. Experiments on the MathWorld dataset across different model sizes and problem complexities confirm that both text-based LLMs and multimodal LMs struggle to create mental representations, questioning how their internal cognitive processes work.

Beyond Pattern Recognition: Probing Mental Representations of LMs

TL;DR

This work interrogates whether language models truly build evolving mental representations when information is revealed incrementally, rather than relying on static pattern recognition. By contrasting a stepwise Mental Modeling approach with standard full-prompt Chain-of-Thought on the MathWorld dataset across text and multimodal models, it demonstrates that LMs struggle to maintain coherent internal representations as depth increases, though vision-based mental modeling yields notable gains. The study finds modality-specific improvements, with Vision Mental Modeling outperforming Text Mental Modeling and modality switching offering potential gains, while Vision-only Inference underperforms CoT, suggesting fundamental differences from human cognition. These results motivate leveraging visual grounding and refined incremental-context strategies to enhance reasoning in future models, and they highlight the need for more robust cross-modal alignment to approach human-like incremental reasoning.

Abstract

Language Models (LMs) have demonstrated impressive capabilities in solving complex reasoning tasks, particularly when prompted to generate intermediate explanations. However, it remains an open question whether these intermediate reasoning traces represent a dynamic, evolving thought process or merely reflect sophisticated pattern recognition acquired during large scale pre training. Drawing inspiration from human cognition, where reasoning unfolds incrementally as new information is assimilated and internal models are continuously updated, we propose to delve deeper into the mental model of various LMs. We propose a new way to assess the mental modeling of LMs, where they are provided with problem details gradually, allowing each new piece of data to build upon and refine the model's internal representation of the task. We systematically compare this step by step mental modeling strategy with traditional full prompt methods across both text only and vision and text modalities. Experiments on the MathWorld dataset across different model sizes and problem complexities confirm that both text-based LLMs and multimodal LMs struggle to create mental representations, questioning how their internal cognitive processes work.

Paper Structure

This paper contains 20 sections, 7 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 2: Accuracy of CoT, VoI, and $\texttt{Vis}_{\text{MM}}$ over increasing depth across models Llama3.2-11B and Llama3.2-90B. Line styles differentiate prompting techniques, while colors distinguish model sizes.
  • Figure 3: Correct predictions at each intermediate step for depth 6 problems in Llama3.2-11B. Blue represents $\texttt{Text}_{\text{MM}}$, while green indicates $\texttt{Vis}_{\text{MM}}$ applied to all incorrect predictions as identified by an Oracle verifier.
  • Figure 4: Graphical representation of difference relationships for a depth 2 problem in Vision Mental Modeling.
  • Figure 5: Complete graph representation for Vision-only Inference.