Table of Contents
Fetching ...

Limits of Imagery Reasoning in Frontier LLM Models

Sergio Y. Hayashi, Nina S. T. Hirata

Abstract

Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, yet they struggle with spatial tasks that require mental simulation, such as mental rotation. This paper investigates whether equipping an LLM with an external ``Imagery Module'' -- a tool capable of rendering and rotating 3D models -- can bridge this gap, functioning as a ``cognitive prosthetic.'' We conducted experiments using a dual-module architecture in which a reasoning module (an MLLM) interacts with an imagery module on 3D model rotation tasks. Performance was lower than expected, with accuracy reaching at most 62.5%. Further investigation suggests that even when the burden of maintaining and manipulating a holistic 3D state is outsourced, the system still fails. This reveals that current frontier models lack the foundational visual-spatial primitives required to interface with imagery. Specifically, they lack: (1) the low-level sensitivity to extract spatial signals such as (a) depth, (b) motion, and (c) short-horizon dynamic prediction; and (2) the capacity to reason contemplatively over images, dynamically shifting visual focus and balancing imagery with symbolic and associative information.

Limits of Imagery Reasoning in Frontier LLM Models

Abstract

Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, yet they struggle with spatial tasks that require mental simulation, such as mental rotation. This paper investigates whether equipping an LLM with an external ``Imagery Module'' -- a tool capable of rendering and rotating 3D models -- can bridge this gap, functioning as a ``cognitive prosthetic.'' We conducted experiments using a dual-module architecture in which a reasoning module (an MLLM) interacts with an imagery module on 3D model rotation tasks. Performance was lower than expected, with accuracy reaching at most 62.5%. Further investigation suggests that even when the burden of maintaining and manipulating a holistic 3D state is outsourced, the system still fails. This reveals that current frontier models lack the foundational visual-spatial primitives required to interface with imagery. Specifically, they lack: (1) the low-level sensitivity to extract spatial signals such as (a) depth, (b) motion, and (c) short-horizon dynamic prediction; and (2) the capacity to reason contemplatively over images, dynamically shifting visual focus and balancing imagery with symbolic and associative information.

Paper Structure

This paper contains 24 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Schematic of the visual feedback loop. The LLM functions as the Reasoning Module, issuing rotation commands to the stateful Imagery Module. The imagery module processes these commands and returns a 2D snapshot of the object from the updated viewpoint.
  • Figure 2: SpatialViz 3D Rotation sample problem. The question is: The left image shows the original cube stack made of equal-sized small cubes. Which of the options on the right cannot be obtained by rotating the original cube stack? Please answer from options A, B or C.
  • Figure 3: The models were asked to detect the rotation (direction and angle) required to transform the left image into the right image. The correct answer is "left:30". All tested models answered incorrectly: GPT-5.2 predicted "right:90", GPT-5.1 predicted "right:45", and Gemini-3-Flash predicted "rotate:ccw:35,left:45,up:20". Here, "ccw" means counterclockwise rotation.
  • Figure 4: VGGT correctly recognized the rotation direction and angle.
  • Figure 5: The models were asked to generate an image rotated by 30 degrees to the left (camera space). None of the models were able to do so correctly, or even produce a close approximation.