Table of Contents
Fetching ...

Exploring Transformer-Augmented LSTM for Temporal and Spatial Feature Learning in Trajectory Prediction

Chandra Raskoti, Weizi Li

TL;DR

The paper tackles autonomous vehicle trajectory prediction by proposing a Transformer-augmented LSTM framework that learns temporal and spatial features in a unified pipeline. It processes target and neighbor trajectories with LSTM encoders, applies Transformer attention for both temporal and spatial contexts, and uses a masked-scatter grid to fuse neighbor information, followed by an LSTM-based decoder to predict $5$ future steps. Despite rigorous benchmarking against STA-LSTM, SA-LSTM, CS-LSTM, and NaiveLSTM on a $3 \times 13$ grid with history $T=15$ and horizon $T'=5$, the Transformer-enhanced model does not outperform STA-LSTM, though the study demonstrates feasibility and identifies directions for architectural improvements. The work highlights a promising direction toward more interpretable and robust trajectory prediction systems, proposing future integration with planning/control pipelines and larger-scale traffic simulations to harness Transformer-based attention more effectively.

Abstract

Accurate vehicle trajectory prediction is crucial for ensuring safe and efficient autonomous driving. This work explores the integration of Transformer based model with Long Short-Term Memory (LSTM) based technique to enhance spatial and temporal feature learning in vehicle trajectory prediction. Here, a hybrid model that combines LSTMs for temporal encoding with a Transformer encoder for capturing complex interactions between vehicles is proposed. Spatial trajectory features of the neighboring vehicles are processed and goes through a masked scatter mechanism in a grid based environment, which is then combined with temporal trajectory of the vehicles. This combined trajectory data are learned by sequential LSTM encoding and Transformer based attention layers. The proposed model is benchmarked against predecessor LSTM based methods, including STA-LSTM, SA-LSTM, CS-LSTM, and NaiveLSTM. Our results, while not outperforming it's predecessor, demonstrate the potential of integrating Transformers with LSTM based technique to build interpretable trajectory prediction model. Future work will explore alternative architectures using Transformer applications to further enhance performance. This study provides a promising direction for improving trajectory prediction models by leveraging transformer based architectures, paving the way for more robust and interpretable vehicle trajectory prediction system.

Exploring Transformer-Augmented LSTM for Temporal and Spatial Feature Learning in Trajectory Prediction

TL;DR

The paper tackles autonomous vehicle trajectory prediction by proposing a Transformer-augmented LSTM framework that learns temporal and spatial features in a unified pipeline. It processes target and neighbor trajectories with LSTM encoders, applies Transformer attention for both temporal and spatial contexts, and uses a masked-scatter grid to fuse neighbor information, followed by an LSTM-based decoder to predict future steps. Despite rigorous benchmarking against STA-LSTM, SA-LSTM, CS-LSTM, and NaiveLSTM on a grid with history and horizon , the Transformer-enhanced model does not outperform STA-LSTM, though the study demonstrates feasibility and identifies directions for architectural improvements. The work highlights a promising direction toward more interpretable and robust trajectory prediction systems, proposing future integration with planning/control pipelines and larger-scale traffic simulations to harness Transformer-based attention more effectively.

Abstract

Accurate vehicle trajectory prediction is crucial for ensuring safe and efficient autonomous driving. This work explores the integration of Transformer based model with Long Short-Term Memory (LSTM) based technique to enhance spatial and temporal feature learning in vehicle trajectory prediction. Here, a hybrid model that combines LSTMs for temporal encoding with a Transformer encoder for capturing complex interactions between vehicles is proposed. Spatial trajectory features of the neighboring vehicles are processed and goes through a masked scatter mechanism in a grid based environment, which is then combined with temporal trajectory of the vehicles. This combined trajectory data are learned by sequential LSTM encoding and Transformer based attention layers. The proposed model is benchmarked against predecessor LSTM based methods, including STA-LSTM, SA-LSTM, CS-LSTM, and NaiveLSTM. Our results, while not outperforming it's predecessor, demonstrate the potential of integrating Transformers with LSTM based technique to build interpretable trajectory prediction model. Future work will explore alternative architectures using Transformer applications to further enhance performance. This study provides a promising direction for improving trajectory prediction models by leveraging transformer based architectures, paving the way for more robust and interpretable vehicle trajectory prediction system.

Paper Structure

This paper contains 13 sections, 7 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Architecture diagram illustrating the end-to-end encoding process of spatial and temporal features in our proposed method. The historical trajectories of the target vehicle and its neighboring vehicles are processed separately through LSTM and Transformer encoders to extract temporal and spatial features. Spatial features are further refined using a masked scatter mechanism based on vehicle presence. The resulting spatial and temporal embeddings are concatenated into a unified feature representation. This combined encoding is then passed through an LSTM based decoder to predict the future $\hat{x}, \hat{y}$ coordinates of the target vehicle over prediction time steps 1 through 5.
  • Figure 2: RMSE comparison of two experiments performed: Original STA-LSTM method, and our Transformer-enhanced STA-LSTM. As expected, the RMSE increases across both models as prediction steps progress from 1 to 5.