Table of Contents
Fetching ...

LAST: LeArning to Think in Space and Time for Generalist Vision-Language Models

Shuai Wang, Daoan Zhang, Tianyi Bai, Shitong Shao, Jiebo Luo, Jiaheng Wei

TL;DR

LAST addresses the challenge that vision-language systems struggle with 3D spatial and long-duration video understanding. It proposes visual chains of thought that inject spatial-temporal reasoning into general vision-language systems by leveraging external tools to generate visual tokens. The approach yields robust gains in zero-shot and fine-tuned settings across spatial, video, and image benchmarks, underscoring its generality and practical impact. Overall, LAST offers a path toward more capable, architecture-agnostic visual-language reasoning by thinking in space and time rather than relying solely on text.

Abstract

Humans can perceive and understand 3D space and long videos from sequential visual observations. But do vision-language models (VLMs) can? Recent work demonstrates that even state-of-the-art VLMs still struggle to understand 3D space and long videos, although they are powerful in typical vision-language tasks. Current methods often rely on specialized architectural designs to improve performance for 3D tasks and video understanding tasks separately. In contrast, we propose LAST, short for LeArn to Think in Space and Time, to jointly improve 3D spatial and long video understanding for general VLMs with only a set of 2D images as inputs. LAST makes VLMs think in space and time rather than only with text before giving the final answer, building visual thinking trajectories in 3D space and temporal dimension. We demonstrate the effectiveness of LAST in two scenarios: 1) zero-shot, where we directly prompt proprietary models; and 2) fine-tuning general VLMs with data that include thinking trajectories in 3D space and time. We show that LAST brings substantial gains in various benchmarks, including 3 spatial understanding, 4 video understanding, and 3 image understanding tasks. Notably, 15.8% gains on EgoSchema with GPT-4o in a zero-shot manner and 8.3 gains on VSI-Bench compared with Qwen2.5-VL-7B.

LAST: LeArning to Think in Space and Time for Generalist Vision-Language Models

TL;DR

LAST addresses the challenge that vision-language systems struggle with 3D spatial and long-duration video understanding. It proposes visual chains of thought that inject spatial-temporal reasoning into general vision-language systems by leveraging external tools to generate visual tokens. The approach yields robust gains in zero-shot and fine-tuned settings across spatial, video, and image benchmarks, underscoring its generality and practical impact. Overall, LAST offers a path toward more capable, architecture-agnostic visual-language reasoning by thinking in space and time rather than relying solely on text.

Abstract

Humans can perceive and understand 3D space and long videos from sequential visual observations. But do vision-language models (VLMs) can? Recent work demonstrates that even state-of-the-art VLMs still struggle to understand 3D space and long videos, although they are powerful in typical vision-language tasks. Current methods often rely on specialized architectural designs to improve performance for 3D tasks and video understanding tasks separately. In contrast, we propose LAST, short for LeArn to Think in Space and Time, to jointly improve 3D spatial and long video understanding for general VLMs with only a set of 2D images as inputs. LAST makes VLMs think in space and time rather than only with text before giving the final answer, building visual thinking trajectories in 3D space and temporal dimension. We demonstrate the effectiveness of LAST in two scenarios: 1) zero-shot, where we directly prompt proprietary models; and 2) fine-tuning general VLMs with data that include thinking trajectories in 3D space and time. We show that LAST brings substantial gains in various benchmarks, including 3 spatial understanding, 4 video understanding, and 3 image understanding tasks. Notably, 15.8% gains on EgoSchema with GPT-4o in a zero-shot manner and 8.3 gains on VSI-Bench compared with Qwen2.5-VL-7B.

Paper Structure

This paper contains 17 sections, 5 equations, 8 figures, 15 tables, 1 algorithm.

Figures (8)

  • Figure 1: Comparison of text CoT and LAST. (a) CoT for VLMs suffers from fixed visual context and generates wrong reasoning traces. CoT fails to capture important frames (i.e. cannot see the basket due to missing frames between the second and third sampled frames) and generates hallucinations in the 5th frame (wall light, not ceiling light). However, in (b), LAST can think in time (use frame selection tools to re-sample video frames and newly sampled video frames are marked with red in Action 1) and think in space (use grounding tools to identify objects). LAST achieves the correct solution by building intermediate visual trajectories.
  • Figure 2: The illustration of data curation pipeline. In the Stage 1, we prompt VLMs with text CoT and only retain sample with the correct answer. In the Stage 2, we prompt VLMs to use external visual tools to solve questions that can not solved by text CoT. Finally we collect data with text thinking trajectories in Stage 1 and visual thinking trajectories in Stage 2.
  • Figure 3: Percentage of tools GPT-4o uses for different benchmarks.
  • Figure 4: Qualitative comparison of text-only CoT and LAST on VSI-Bench vsibench with GPT-4o. We highlight errors with red. GPT-4o with text CoT fails to understand correspondence of different sofas appeared in the video. In contrast, LAST could understand correspondence of two sofas in the video and get the correct solution. For clarity, we enlarge marks in frames.
  • Figure 5: Performance on EgoSchema under different frames of Qwen2.5-VL-7B (baseline), text CoT and LAST-7B (ours).
  • ...and 3 more figures