Table of Contents
Fetching ...

Vision-Language Memory for Spatial Reasoning

Zuntao Liu, Yi Du, Taimeng Fu, Shaoshu Su, Cherie Ho, Chen Wang

TL;DR

This work addresses the challenge of video-based spatial reasoning by eliminating reliance on explicit 3D inputs and introducing a view-consistent, 3D-aware representation learned from 2D video. It couples this representation with a dual-memory system comprising a sliding-window Working Memory and a fixed-capacity Episodic Memory to support long-horizon reasoning while keeping computation bounded. The key innovations include Adaptive 3D Position Injection, Viewpoint-Aware Geometry Alignment, and a semantic-geometric fusion strategy that yields stable cross-view representations, plus memory fusion with gated updates. Empirical results on VSI-Bench, VSTI-Bench, ScanQA, and SQA3D show state-of-the-art performance among video-only models and strong performance relative to 3D-input baselines, demonstrating the practical potential for robust, memory-driven spatial understanding in dynamic scenes.

Abstract

Spatial reasoning is a critical capability for intelligent robots, yet current vision-language models (VLMs) still fall short of human-level performance in video-based spatial reasoning. This gap mainly stems from two challenges: a semantic-geometric misalignment that prevents consistent 3D understanding, and the absence of persistent memory to retain 3D representation and understanding over time. To address these limitations, we present VLM$^2$, a Vision-Language Model with persistent Memory for spatial reasoning with a view-consistent, 3D-aware representation purely from 2D video. Specifically, to enhance long-horizon reasoning, we incorporate a dual-memory module, consisting of a working memory that operates as a sliding window to focus on immediate context, and an episodic memory that consolidates and stores critical long-term information. This design enables efficient and long-horizon spatial reasoning with a fixed computational cost. Extensive experiments on multiple benchmarks show that VLM$^2$ achieves state-of-the-art performance among video-only models, significantly advancing the frontier of visual-spatial intelligence.

Vision-Language Memory for Spatial Reasoning

TL;DR

This work addresses the challenge of video-based spatial reasoning by eliminating reliance on explicit 3D inputs and introducing a view-consistent, 3D-aware representation learned from 2D video. It couples this representation with a dual-memory system comprising a sliding-window Working Memory and a fixed-capacity Episodic Memory to support long-horizon reasoning while keeping computation bounded. The key innovations include Adaptive 3D Position Injection, Viewpoint-Aware Geometry Alignment, and a semantic-geometric fusion strategy that yields stable cross-view representations, plus memory fusion with gated updates. Empirical results on VSI-Bench, VSTI-Bench, ScanQA, and SQA3D show state-of-the-art performance among video-only models and strong performance relative to 3D-input baselines, demonstrating the practical potential for robust, memory-driven spatial understanding in dynamic scenes.

Abstract

Spatial reasoning is a critical capability for intelligent robots, yet current vision-language models (VLMs) still fall short of human-level performance in video-based spatial reasoning. This gap mainly stems from two challenges: a semantic-geometric misalignment that prevents consistent 3D understanding, and the absence of persistent memory to retain 3D representation and understanding over time. To address these limitations, we present VLM, a Vision-Language Model with persistent Memory for spatial reasoning with a view-consistent, 3D-aware representation purely from 2D video. Specifically, to enhance long-horizon reasoning, we incorporate a dual-memory module, consisting of a working memory that operates as a sliding window to focus on immediate context, and an episodic memory that consolidates and stores critical long-term information. This design enables efficient and long-horizon spatial reasoning with a fixed computational cost. Extensive experiments on multiple benchmarks show that VLM achieves state-of-the-art performance among video-only models, significantly advancing the frontier of visual-spatial intelligence.

Paper Structure

This paper contains 48 sections, 8 equations, 7 figures, 11 tables, 1 algorithm.

Figures (7)

  • Figure 1: VLM2 is a Vision-Language Model with Memory for long-horizon spatial reasoning that constructs view-consistent 3D-aware representations from 2D video and maintains persistent memory over time. Such capabilities are critical for questions like "How many chairs are in this room?", which require both consistent cross-view alignment and long-horizon memory.
  • Figure 2: Overview of the VLM2 Architecture. Our model constructs a view-consistent 3D-aware representation via adaptive 3D position injection, viewpoint-aware geometry alignment and semantic-geometric fusion. A dual-memory module with a sliding-window working memory and a fixed-capacity episodic memory maintains these representations over time, supporting long-horizon spatial reasoning.
  • Figure 3: Qualitative examples on VSI-Bench yang2025thinking.
  • Figure 4: Qualitative examples on VSI-Bench yang2025thinking.
  • Figure 5: Qualitative examples on VSI-Bench yang2025thinking.
  • ...and 2 more figures