Table of Contents
Fetching ...

REM: Evaluating LLM Embodied Spatial Reasoning through Multi-Frame Trajectories

Jacob Thompson, Emiliano Garcia-Lopez, Yonatan Bisk

TL;DR

REM tackles the challenge of embodied spatial reasoning by benchmarking how well multimodal LLMs maintain persistent, viewpoint-invariant representations of objects and their relations across egocentric motion. The authors introduce three Blender-generated datasets (Baseline, Single Frame, Full Rotation) with explicit movement annotations and automated QA to probe object permanence, spatial relations, and numerical tracking. Across seven state-of-the-art models, performance degrades with scene complexity and viewpoint changes, revealing systematic gaps in grounding and object individuation. The work provides targeted diagnostics and a scalable benchmark to drive future improvements in spatial understanding for embodied AI.

Abstract

Humans build viewpoint-independent cognitive maps through navigation, enabling intuitive reasoning about object permanence and spatial relations. We argue that multimodal large language models (MLLMs), despite extensive video training, lack this fundamental spatial reasoning capability, a critical limitation for embodied applications. To demonstrate these limitations and drive research, we introduce REM (Reasoning over Embodied Multi-Frame Trajectories), a benchmark using controllable 3D environments for long-horizon embodied spatial reasoning. REM systematically evaluates key aspects like object permanence/distinction, spatial relationships, and numerical tracking across dynamic embodied viewpoints. Our evaluation shows that the best-performing current models exhibit promising overall performance, but become increasingly unreliable at even moderate complexity levels easily handled by humans. These findings highlight challenges MLLMs face in developing robust spatial representations from sequential visual input. Consequently, REM provides targeted metrics and diagnostics to foster improved spatial understanding in future models.

REM: Evaluating LLM Embodied Spatial Reasoning through Multi-Frame Trajectories

TL;DR

REM tackles the challenge of embodied spatial reasoning by benchmarking how well multimodal LLMs maintain persistent, viewpoint-invariant representations of objects and their relations across egocentric motion. The authors introduce three Blender-generated datasets (Baseline, Single Frame, Full Rotation) with explicit movement annotations and automated QA to probe object permanence, spatial relations, and numerical tracking. Across seven state-of-the-art models, performance degrades with scene complexity and viewpoint changes, revealing systematic gaps in grounding and object individuation. The work provides targeted diagnostics and a scalable benchmark to drive future improvements in spatial understanding for embodied AI.

Abstract

Humans build viewpoint-independent cognitive maps through navigation, enabling intuitive reasoning about object permanence and spatial relations. We argue that multimodal large language models (MLLMs), despite extensive video training, lack this fundamental spatial reasoning capability, a critical limitation for embodied applications. To demonstrate these limitations and drive research, we introduce REM (Reasoning over Embodied Multi-Frame Trajectories), a benchmark using controllable 3D environments for long-horizon embodied spatial reasoning. REM systematically evaluates key aspects like object permanence/distinction, spatial relationships, and numerical tracking across dynamic embodied viewpoints. Our evaluation shows that the best-performing current models exhibit promising overall performance, but become increasingly unreliable at even moderate complexity levels easily handled by humans. These findings highlight challenges MLLMs face in developing robust spatial representations from sequential visual input. Consequently, REM provides targeted metrics and diagnostics to foster improved spatial understanding in future models.

Paper Structure

This paper contains 16 sections, 13 figures, 3 tables.

Figures (13)

  • Figure 1: REM at a glance. Left: top-down plot showing object distribution and camera trajectory. Center: egocentric views from selected frames, simulating an agent's perception during navigation. Right: example question-answer pairs that test different aspects of spatial reasoning: counting, comparison, temporal ordering, and left/right relative positioning.
  • Figure 2: Example length-4 trajectory from the baseline dataset. Models receive the sequence of egocentric visual frames and the corresponding discrete actions ('15° Right', '1m Forward', '15° Right') taken between frames. Evaluating performance on such sequences tests the model's ability to integrate visual perception with known movement for spatial reasoning across changing viewpoints.
  • Figure 3: Full Rotation dataset example scene: (left) top-down scene layout showing object positions, with the camera at the origin, (top-right) view at $0^{\circ}$, and (bottom-right) view after $180^{\circ}$ rotation. Note the red sphere is intentionally duplicated between views, while other objects occupy identical spatial positions but are visually distinct entities. Peripheral objects at the top and bottom of the scene layout maintain visual continuity during camera transitions, preventing empty frames with movement ambiguity.
  • Figure 4: o3 QA performance across three scaling factors: (a) observed object count, (b) observed duplicate count, and (c) trajectory length. Curves show average correctness for Numerical Object Comparison, Temporal Ordering, and Left/Right Positioning tasks.
  • Figure 5: o3 numerical object comparison count accuracy vs. the difference in target object counts. Includes 95% confidence interval. Random 33% baseline provided in red.
  • ...and 8 more figures