Table of Contents
Fetching ...

VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions

Seokha Moon, Hyun Woo, Hongbeen Park, Haeji Jung, Reza Mahjourian, Hyung-gun Chi, Hyerin Lim, Sangpil Kim, Jinkyu Kim

TL;DR

VisionTrap tackles trajectory prediction by augmenting traditional inputs with surround-view visual semantics and training-time textual guidance. It combines four modules—Per-agent State Encoder, Visual Semantic Encoder, Text-driven Guidance Module, and Trajectory Decoder—with a Transformation Module to handle orientation, enabling real-time inference at about $53$ ms while leveraging rich visual and textual information. The approach is validated on nuScenes, aided by a newly released nuScenes-Text dataset generated via a VLM and refined by an LLM, and is shown to improve standard trajectory metrics and qualitative reasoning. These contributions demonstrate that visual context and natural language supervision can unlock new performance and interpretability in autonomous driving prediction tasks. The nuScenes-Text dataset further supports future multimodal trajectory research by providing aligned image-text supervision for every scene.

Abstract

Predicting future trajectories for other road agents is an essential task for autonomous vehicles. Established trajectory prediction methods primarily use agent tracks generated by a detection and tracking system and HD map as inputs. In this work, we propose a novel method that also incorporates visual input from surround-view cameras, allowing the model to utilize visual cues such as human gazes and gestures, road conditions, vehicle turn signals, etc, which are typically hidden from the model in prior methods. Furthermore, we use textual descriptions generated by a Vision-Language Model (VLM) and refined by a Large Language Model (LLM) as supervision during training to guide the model on what to learn from the input data. Despite using these extra inputs, our method achieves a latency of 53 ms, making it feasible for real-time processing, which is significantly faster than that of previous single-agent prediction methods with similar performance. Our experiments show that both the visual inputs and the textual descriptions contribute to improvements in trajectory prediction performance, and our qualitative analysis highlights how the model is able to exploit these additional inputs. Lastly, in this work we create and release the nuScenes-Text dataset, which augments the established nuScenes dataset with rich textual annotations for every scene, demonstrating the positive impact of utilizing VLM on trajectory prediction. Our project page is at https://moonseokha.github.io/VisionTrap/

VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions

TL;DR

VisionTrap tackles trajectory prediction by augmenting traditional inputs with surround-view visual semantics and training-time textual guidance. It combines four modules—Per-agent State Encoder, Visual Semantic Encoder, Text-driven Guidance Module, and Trajectory Decoder—with a Transformation Module to handle orientation, enabling real-time inference at about ms while leveraging rich visual and textual information. The approach is validated on nuScenes, aided by a newly released nuScenes-Text dataset generated via a VLM and refined by an LLM, and is shown to improve standard trajectory metrics and qualitative reasoning. These contributions demonstrate that visual context and natural language supervision can unlock new performance and interpretability in autonomous driving prediction tasks. The nuScenes-Text dataset further supports future multimodal trajectory research by providing aligned image-text supervision for every scene.

Abstract

Predicting future trajectories for other road agents is an essential task for autonomous vehicles. Established trajectory prediction methods primarily use agent tracks generated by a detection and tracking system and HD map as inputs. In this work, we propose a novel method that also incorporates visual input from surround-view cameras, allowing the model to utilize visual cues such as human gazes and gestures, road conditions, vehicle turn signals, etc, which are typically hidden from the model in prior methods. Furthermore, we use textual descriptions generated by a Vision-Language Model (VLM) and refined by a Large Language Model (LLM) as supervision during training to guide the model on what to learn from the input data. Despite using these extra inputs, our method achieves a latency of 53 ms, making it feasible for real-time processing, which is significantly faster than that of previous single-agent prediction methods with similar performance. Our experiments show that both the visual inputs and the textual descriptions contribute to improvements in trajectory prediction performance, and our qualitative analysis highlights how the model is able to exploit these additional inputs. Lastly, in this work we create and release the nuScenes-Text dataset, which augments the established nuScenes dataset with rich textual annotations for every scene, demonstrating the positive impact of utilizing VLM on trajectory prediction. Our project page is at https://moonseokha.github.io/VisionTrap/
Paper Structure (14 sections, 9 equations, 21 figures, 4 tables)

This paper contains 14 sections, 9 equations, 21 figures, 4 tables.

Figures (21)

  • Figure 1: Existing approaches are often conditioned only on agents' past trajectories and HD map to predict future trajectories. Here, we want to explore leveraging camera images and textual descriptions obtained from images to better learn the agent's behavioral context and agent-environment interactions by incorporating high-level semantic information into the prediction process, such as "a pedestrian is carrying stacked items, and is expected to stationary."
  • Figure 2: An overview of VisionTrap, which consists of four main steps: (i) Per-agent State Embedding, which produces per-agent context features given agents' state observations; (ii) Visual Semantic Encoder, which transforms multi-view images with an HD map into a unified BEV feature, updating agents' state embedding via a deformable attention layer; (iii) Text-driven Guidance Module, which supervises the model to reason about detailed visual semantics and (iv) Trajectory Decoder, which predicts agents' the future poses in a fixed time horizon.
  • Figure 3: An overview of our Text-driven Guidance Module. We extract word-level embeddings using pretrained BERT bert as a text encoder, and then we use an attention module to aggregate these per-word embeddings into a composite sentence-level embedding. Based on the cosine similarity between these embeddings, we apply contrastive learning loss to ground textual descriptions into the agent's state embedding.
  • Figure 4: An overview of transformation module, which standardizes agents' orientation.
  • Figure 5: To create the nuScenes-Text Dataset, three main steps are involved: (i) Fine-tuning stage using DRAMA Dataset drama, (ii) Image-to-Text Generation stage applying the fine-tuned VLM to the nuScenes Dataset nuscenes, and (iii) Text Refinement process using ground truth information (e.g. GT class, Maneuvering) along with generated text and GPT gpt3. The red color indicates that needs to be filtered out, while the cyan color indicates additional content related to the intention.
  • ...and 16 more figures