Table of Contents
Fetching ...

VT-Former: An Exploratory Study on Vehicle Trajectory Prediction for Highway Surveillance through Graph Isomorphism and Transformer

Armin Danesh Pazho, Ghazal Alinezhad Noghre, Vinit Katariya, Hamed Tabkhi

TL;DR

VT-Former introduces a novel SVTP framework that fuses Graph Attentive Tokenization with a decoder-only Transformer to predict highway vehicle trajectories from surveillance data. The GAT module encodes social interactions via a fully connected graph and relative motion, while the Transformer predictor autoregressively generates $T_{PH}$ future coordinates. Across NGSIM and CHD datasets, VT-Former achieves SotA or competitive results, notably excelling with higher frame rates and shorter observation horizons, and reveals trade-offs between graph complexity and predictive accuracy. The work demonstrates the viability of combining graph-based tokenization with transformer forecasting for rapid, highway-surveillance trajectory prediction and motivates further exploration of dynamic graphs and higher-frequency data for real-time safety applications.

Abstract

Enhancing roadway safety has become an essential computer vision focus area for Intelligent Transportation Systems (ITS). As a part of ITS, Vehicle Trajectory Prediction (VTP) aims to forecast a vehicle's future positions based on its past and current movements. VTP is a pivotal element for road safety, aiding in applications such as traffic management, accident prevention, work-zone safety, and energy optimization. While most works in this field focus on autonomous driving, with the growing number of surveillance cameras, another sub-field emerges for surveillance VTP with its own set of challenges. In this paper, we introduce VT-Former, a novel transformer-based VTP approach for highway safety and surveillance. In addition to utilizing transformers to capture long-range temporal patterns, a new Graph Attentive Tokenization (GAT) module has been proposed to capture intricate social interactions among vehicles. This study seeks to explore both the advantages and the limitations inherent in combining transformer architecture with graphs for VTP. Our investigation, conducted across three benchmark datasets from diverse surveillance viewpoints, showcases the State-of-the-Art (SotA) or comparable performance of VT-Former in predicting vehicle trajectories. This study underscores the potential of VT-Former and its architecture, opening new avenues for future research and exploration.

VT-Former: An Exploratory Study on Vehicle Trajectory Prediction for Highway Surveillance through Graph Isomorphism and Transformer

TL;DR

VT-Former introduces a novel SVTP framework that fuses Graph Attentive Tokenization with a decoder-only Transformer to predict highway vehicle trajectories from surveillance data. The GAT module encodes social interactions via a fully connected graph and relative motion, while the Transformer predictor autoregressively generates future coordinates. Across NGSIM and CHD datasets, VT-Former achieves SotA or competitive results, notably excelling with higher frame rates and shorter observation horizons, and reveals trade-offs between graph complexity and predictive accuracy. The work demonstrates the viability of combining graph-based tokenization with transformer forecasting for rapid, highway-surveillance trajectory prediction and motivates further exploration of dynamic graphs and higher-frequency data for real-time safety applications.

Abstract

Enhancing roadway safety has become an essential computer vision focus area for Intelligent Transportation Systems (ITS). As a part of ITS, Vehicle Trajectory Prediction (VTP) aims to forecast a vehicle's future positions based on its past and current movements. VTP is a pivotal element for road safety, aiding in applications such as traffic management, accident prevention, work-zone safety, and energy optimization. While most works in this field focus on autonomous driving, with the growing number of surveillance cameras, another sub-field emerges for surveillance VTP with its own set of challenges. In this paper, we introduce VT-Former, a novel transformer-based VTP approach for highway safety and surveillance. In addition to utilizing transformers to capture long-range temporal patterns, a new Graph Attentive Tokenization (GAT) module has been proposed to capture intricate social interactions among vehicles. This study seeks to explore both the advantages and the limitations inherent in combining transformer architecture with graphs for VTP. Our investigation, conducted across three benchmark datasets from diverse surveillance viewpoints, showcases the State-of-the-Art (SotA) or comparable performance of VT-Former in predicting vehicle trajectories. This study underscores the potential of VT-Former and its architecture, opening new avenues for future research and exploration.
Paper Structure (19 sections, 10 equations, 4 figures, 4 tables)

This paper contains 19 sections, 10 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: A conceptual overview of the VT-Former. The input sequence, characterized by an observation horizon denoted as $T_{OH}$, undergoes tokenization via a novel graph-based tokenizer. The resultant tokens are subsequently processed through a transformer-based predictive model to forecast future trajectories.
  • Figure 2: VT-Former at a glance: our approach begins with the application of Graph Attentive Tokenization (GAT) to create enriched token sequences that encapsulate social interactions. Subsequently, these enhanced sequences are fed through the Transformer Prediction module, which consists of 8 layers of transformer decoders, each equipped with 4 heads, generating predicted trajectories. VT-Former is an autoregressive sequence-to-sequence model that generates the output in multiple steps. However, for easier visualization, the output tokens are shown in the last step where they are completely generated. $T_{OH}$, $TS_i$, $T_{PH}$, and $PC_i$ denote the observation horizon time, token sequence of the $i^{th}$ vehicle, prediction horizon time, and the predicted trajectory, respectively.
  • Figure 3: Graph Attentive Tokenization (GAT). GAT gets the past trajectory ($C_i$) of each vehicle ($V_i$) over the observation horizon ($T_{OH}$) for each vehicle and calculates the relative movement ($\Delta C_i$). These two components are concatenated and passed through a fully connected layer, expanding the feature maps by a factor of 2. Subsequently, we employ a Graph Network to capture intricate interactions among vehicles within the scene. The output of this Graph Network is further concatenated with the relative movement, accentuating the temporal evolution within the sequence. Finally, to infuse temporal order into the token sequences, we leverage temporal encoding.
  • Figure 4: Qualitative performance of VT-Former on CHD, featuring a scene with a merging lane from the left and a mild right curve. The angle between predicted, observed, and actual trajectories in samples A and B demonstrates VT-Former's ability to predict the correct path with respect to road geometry. However, it struggles with the high acceleration of vehicle 2 in sample B. Samples are cropped for clarity.