Beyond Pattern Recognition: Probing Mental Representations of LMs
Moritz Miller, Kumar Shridhar
TL;DR
This work interrogates whether language models truly build evolving mental representations when information is revealed incrementally, rather than relying on static pattern recognition. By contrasting a stepwise Mental Modeling approach with standard full-prompt Chain-of-Thought on the MathWorld dataset across text and multimodal models, it demonstrates that LMs struggle to maintain coherent internal representations as depth increases, though vision-based mental modeling yields notable gains. The study finds modality-specific improvements, with Vision Mental Modeling outperforming Text Mental Modeling and modality switching offering potential gains, while Vision-only Inference underperforms CoT, suggesting fundamental differences from human cognition. These results motivate leveraging visual grounding and refined incremental-context strategies to enhance reasoning in future models, and they highlight the need for more robust cross-modal alignment to approach human-like incremental reasoning.
Abstract
Language Models (LMs) have demonstrated impressive capabilities in solving complex reasoning tasks, particularly when prompted to generate intermediate explanations. However, it remains an open question whether these intermediate reasoning traces represent a dynamic, evolving thought process or merely reflect sophisticated pattern recognition acquired during large scale pre training. Drawing inspiration from human cognition, where reasoning unfolds incrementally as new information is assimilated and internal models are continuously updated, we propose to delve deeper into the mental model of various LMs. We propose a new way to assess the mental modeling of LMs, where they are provided with problem details gradually, allowing each new piece of data to build upon and refine the model's internal representation of the task. We systematically compare this step by step mental modeling strategy with traditional full prompt methods across both text only and vision and text modalities. Experiments on the MathWorld dataset across different model sizes and problem complexities confirm that both text-based LLMs and multimodal LMs struggle to create mental representations, questioning how their internal cognitive processes work.
