TraceVision: Trajectory-Aware Vision-Language Model for Human-Like Spatial Understanding
Fan Yang, Shurong Zheng, Hongyin Zhao, Yufei Zhan, Xin Li, Yousong Zhu, Chaoyang Zhao Ming Tang, Jinqiao Wang
TL;DR
TraceVision tackles the challenge of human-like spatial understanding in vision-language models by modeling continuous human attention trajectories and fusing them with visual and textual signals. It introduces a Trajectory-aware Visual Perception (TVP) module and a semantic-guided Douglas-Peucker trajectory simplification to enable precise region reasoning, grounded by the 320k-sample Reasoning-based Interactive Localized Narratives (RILN) dataset. A lightweight segmentation decoder and a three-stage training curriculum further empower trajectory-guided captioning, trajectory prediction, and grounded segmentation, achieving state-of-the-art results across image, video, and region-grounding tasks. The approach demonstrates strong efficiency and interpretability, paving the way for intuitive spatial interaction and human-aligned visual understanding in multimodal systems.
Abstract
Recent Large Vision-Language Models (LVLMs) demonstrate remarkable capabilities in image understanding and natural language generation. However, current approaches focus predominantly on global image understanding, struggling to simulate human visual attention trajectories and explain associations between descriptions and specific regions. We propose TraceVision, a unified vision-language model integrating trajectory-aware spatial understanding in an end-to-end framework. TraceVision employs a Trajectory-aware Visual Perception (TVP) module for bidirectional fusion of visual features and trajectory information. We design geometric simplification to extract semantic keypoints from raw trajectories and propose a three-stage training pipeline where trajectories guide description generation and region localization. We extend TraceVision to trajectory-guided segmentation and video scene understanding, enabling cross-frame tracking and temporal attention analysis. We construct the Reasoning-based Interactive Localized Narratives (RILN) dataset to enhance logical reasoning and interpretability. Extensive experiments on trajectory-guided captioning, text-guided trajectory prediction, understanding, and segmentation demonstrate that TraceVision achieves state-of-the-art performance, establishing a foundation for intuitive spatial interaction and interpretable visual understanding.
