Spatio-Temporal Data Enhanced Vision-Language Model for Traffic Scene Understanding
Jingtian Ma, Jingyuan Wang, Wayne Xin Zhao, Guoping Liu, Xiang Wen
TL;DR
This work tackles traffic scene understanding by bridging spatio-temporal data with vision-language models. It introduces ST-CLIP, a CLIP-backed framework augmented with SCAMP, which generates spatio-temporal context-aware, multi-aspect prompts and applies a bi-level attention scheme to jointly reason about low-level visual cues and high-level semantic relations. The method leverages dynamic trajectory-derived ST-context representations, segment-level features, and tracklet-based trajectory context to produce autonomous, multi-aspect scene descriptions under few-shot settings. Substantial experiments on Beijing and Chengdu datasets show ST-CLIP outperforming strong baselines across scene, surface, width, and accessibility, with ablations highlighting the contribution of ST-context and the bi-level prompt attention. The work signals a practical path toward semantically enriched traffic maps by integrating spatio-temporal information with large pre-trained multimodal models, while outlining future enhancements for generative narratives and richer environmental context.
Abstract
Nowadays, navigation and ride-sharing apps have collected numerous images with spatio-temporal data. A core technology for analyzing such images, associated with spatiotemporal information, is Traffic Scene Understanding (TSU), which aims to provide a comprehensive description of the traffic scene. Unlike traditional spatio-temporal data analysis tasks, the dependence on both spatio-temporal and visual-textual data introduces distinct challenges to TSU task. However, recent research often treats TSU as a common image understanding task, ignoring the spatio-temporal information and overlooking the interrelations between different aspects of the traffic scene. To address these issues, we propose a novel SpatioTemporal Enhanced Model based on CILP (ST-CLIP) for TSU. Our model uses the classic vision-language model, CLIP, as the backbone, and designs a Spatio-temporal Context Aware Multiaspect Prompt (SCAMP) learning method to incorporate spatiotemporal information into TSU. The prompt learning method consists of two components: A dynamic spatio-temporal context representation module that extracts representation vectors of spatio-temporal data for each traffic scene image, and a bi-level ST-aware multi-aspect prompt learning module that integrates the ST-context representation vectors into word embeddings of prompts for the CLIP model. The second module also extracts low-level visual features and image-wise high-level semantic features to exploit interactive relations among different aspects of traffic scenes. To the best of our knowledge, this is the first attempt to integrate spatio-temporal information into visionlanguage models to facilitate TSU task. Experiments on two realworld datasets demonstrate superior performance in the complex scene understanding scenarios with a few-shot learning strategy.
