Table of Contents
Fetching ...

TraceVision: Trajectory-Aware Vision-Language Model for Human-Like Spatial Understanding

Fan Yang, Shurong Zheng, Hongyin Zhao, Yufei Zhan, Xin Li, Yousong Zhu, Chaoyang Zhao Ming Tang, Jinqiao Wang

TL;DR

TraceVision tackles the challenge of human-like spatial understanding in vision-language models by modeling continuous human attention trajectories and fusing them with visual and textual signals. It introduces a Trajectory-aware Visual Perception (TVP) module and a semantic-guided Douglas-Peucker trajectory simplification to enable precise region reasoning, grounded by the 320k-sample Reasoning-based Interactive Localized Narratives (RILN) dataset. A lightweight segmentation decoder and a three-stage training curriculum further empower trajectory-guided captioning, trajectory prediction, and grounded segmentation, achieving state-of-the-art results across image, video, and region-grounding tasks. The approach demonstrates strong efficiency and interpretability, paving the way for intuitive spatial interaction and human-aligned visual understanding in multimodal systems.

Abstract

Recent Large Vision-Language Models (LVLMs) demonstrate remarkable capabilities in image understanding and natural language generation. However, current approaches focus predominantly on global image understanding, struggling to simulate human visual attention trajectories and explain associations between descriptions and specific regions. We propose TraceVision, a unified vision-language model integrating trajectory-aware spatial understanding in an end-to-end framework. TraceVision employs a Trajectory-aware Visual Perception (TVP) module for bidirectional fusion of visual features and trajectory information. We design geometric simplification to extract semantic keypoints from raw trajectories and propose a three-stage training pipeline where trajectories guide description generation and region localization. We extend TraceVision to trajectory-guided segmentation and video scene understanding, enabling cross-frame tracking and temporal attention analysis. We construct the Reasoning-based Interactive Localized Narratives (RILN) dataset to enhance logical reasoning and interpretability. Extensive experiments on trajectory-guided captioning, text-guided trajectory prediction, understanding, and segmentation demonstrate that TraceVision achieves state-of-the-art performance, establishing a foundation for intuitive spatial interaction and interpretable visual understanding.

TraceVision: Trajectory-Aware Vision-Language Model for Human-Like Spatial Understanding

TL;DR

TraceVision tackles the challenge of human-like spatial understanding in vision-language models by modeling continuous human attention trajectories and fusing them with visual and textual signals. It introduces a Trajectory-aware Visual Perception (TVP) module and a semantic-guided Douglas-Peucker trajectory simplification to enable precise region reasoning, grounded by the 320k-sample Reasoning-based Interactive Localized Narratives (RILN) dataset. A lightweight segmentation decoder and a three-stage training curriculum further empower trajectory-guided captioning, trajectory prediction, and grounded segmentation, achieving state-of-the-art results across image, video, and region-grounding tasks. The approach demonstrates strong efficiency and interpretability, paving the way for intuitive spatial interaction and human-aligned visual understanding in multimodal systems.

Abstract

Recent Large Vision-Language Models (LVLMs) demonstrate remarkable capabilities in image understanding and natural language generation. However, current approaches focus predominantly on global image understanding, struggling to simulate human visual attention trajectories and explain associations between descriptions and specific regions. We propose TraceVision, a unified vision-language model integrating trajectory-aware spatial understanding in an end-to-end framework. TraceVision employs a Trajectory-aware Visual Perception (TVP) module for bidirectional fusion of visual features and trajectory information. We design geometric simplification to extract semantic keypoints from raw trajectories and propose a three-stage training pipeline where trajectories guide description generation and region localization. We extend TraceVision to trajectory-guided segmentation and video scene understanding, enabling cross-frame tracking and temporal attention analysis. We construct the Reasoning-based Interactive Localized Narratives (RILN) dataset to enhance logical reasoning and interpretability. Extensive experiments on trajectory-guided captioning, text-guided trajectory prediction, understanding, and segmentation demonstrate that TraceVision achieves state-of-the-art performance, establishing a foundation for intuitive spatial interaction and interpretable visual understanding.
Paper Structure (47 sections, 13 equations, 11 figures, 13 tables)

This paper contains 47 sections, 13 equations, 11 figures, 13 tables.

Figures (11)

  • Figure 1: Multi-modal capabilities of TraceVision across image, video, and segmentation tasks. The model processes traditional captioning, trajectory-guided interpretation and grounding, video sequence analysis, and precise segmentation, demonstrating its versatility in handling diverse visual understanding scenarios with trajectory-based spatial reasoning.
  • Figure 2: TraceVision architecture overview. The model processes trajectory, image, and text inputs through a unified framework. The TVP module performs bidirectional fusion between visual and trajectory features via cross-attention, enabling trajectory-conditioned captioning, text-guided trajectory prediction tasks.
  • Figure 3: Trajectory simplification: Geometric Simplification algorithm reduces 410 original points to 37 keypoints while preserving spatial structure.
  • Figure 4: RILN dataset construction pipeline showing the generation of diverse trajectory-based tasks from image-trajectory pairs. The pipeline creates four main task types: referential trajectory interpretation, grounding, interactive reasoning Q&A, and multi-turn dialogue synthesis, with hierarchical reasoning trees spanning from global scene understanding to fine-grained object-level spatial reasoning.
  • Figure 5: Performance analysis visualization comparing trajectory-aware methods across different evaluation metrics and datasets.
  • ...and 6 more figures