Table of Contents
Fetching ...

Hierarchical Light Transformer Ensembles for Multimodal Trajectory Forecasting

Adrien Lafage, Mathieu Barbier, Gianni Franchi, David Filliat

TL;DR

The paper tackles multimodal trajectory forecasting for safety-critical systems by introducing Hierarchical Light Transformer Ensembles (HLT-Ens), which combine a hierarchical density representation with an efficient, grouped-transformer ensembling framework. A novel Hierarchical Winner-Takes-All (HWTA) loss trains a two-level mixture model consisting of meta-modes and sub-modes, paired with Grouped Fully-Connected (GFC) and Grouped Multi-head Attention (GMHA) to realize lightweight, diverse subnetworks. The approach yields meta-mode–level predictions that are robust and size-efficient, enabling fast compression of the prediction set while preserving coverage of the multimodal distribution. Experiments on Argoverse 1 and Interaction demonstrate state-of-the-art results with substantially lower computational cost than traditional deep ensembles, highlighting practical potential for real-time, uncertainty-aware trajectory forecasting. The work advances the design of scalable, interpretable multimodal forecasts by integrating hierarchical density modeling with transformer-based ensembling.

Abstract

Accurate trajectory forecasting is crucial for the performance of various systems, such as advanced driver-assistance systems and self-driving vehicles. These forecasts allow us to anticipate events that lead to collisions and, therefore, to mitigate them. Deep Neural Networks have excelled in motion forecasting, but overconfidence and weak uncertainty quantification persist. Deep Ensembles address these concerns, yet applying them to multimodal distributions remains challenging. In this paper, we propose a novel approach named Hierarchical Light Transformer Ensembles (HLT-Ens) aimed at efficiently training an ensemble of Transformer architectures using a novel hierarchical loss function. HLT-Ens leverages grouped fully connected layers, inspired by grouped convolution techniques, to capture multimodal distributions effectively. We demonstrate that HLT-Ens achieves state-of-the-art performance levels through extensive experimentation, offering a promising avenue for improving trajectory forecasting techniques.

Hierarchical Light Transformer Ensembles for Multimodal Trajectory Forecasting

TL;DR

The paper tackles multimodal trajectory forecasting for safety-critical systems by introducing Hierarchical Light Transformer Ensembles (HLT-Ens), which combine a hierarchical density representation with an efficient, grouped-transformer ensembling framework. A novel Hierarchical Winner-Takes-All (HWTA) loss trains a two-level mixture model consisting of meta-modes and sub-modes, paired with Grouped Fully-Connected (GFC) and Grouped Multi-head Attention (GMHA) to realize lightweight, diverse subnetworks. The approach yields meta-mode–level predictions that are robust and size-efficient, enabling fast compression of the prediction set while preserving coverage of the multimodal distribution. Experiments on Argoverse 1 and Interaction demonstrate state-of-the-art results with substantially lower computational cost than traditional deep ensembles, highlighting practical potential for real-time, uncertainty-aware trajectory forecasting. The work advances the design of scalable, interpretable multimodal forecasts by integrating hierarchical density modeling with transformer-based ensembling.

Abstract

Accurate trajectory forecasting is crucial for the performance of various systems, such as advanced driver-assistance systems and self-driving vehicles. These forecasts allow us to anticipate events that lead to collisions and, therefore, to mitigate them. Deep Neural Networks have excelled in motion forecasting, but overconfidence and weak uncertainty quantification persist. Deep Ensembles address these concerns, yet applying them to multimodal distributions remains challenging. In this paper, we propose a novel approach named Hierarchical Light Transformer Ensembles (HLT-Ens) aimed at efficiently training an ensemble of Transformer architectures using a novel hierarchical loss function. HLT-Ens leverages grouped fully connected layers, inspired by grouped convolution techniques, to capture multimodal distributions effectively. We demonstrate that HLT-Ens achieves state-of-the-art performance levels through extensive experimentation, offering a promising avenue for improving trajectory forecasting techniques.
Paper Structure (33 sections, 18 equations, 9 figures, 9 tables)

This paper contains 33 sections, 18 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Illustration of a hierarchical mixture (right) compared to a classical one (left) on a sample from the Argoverse 1 dataset. Ground truth is in dark blue in both figures, forecasts are colored dashed lines, and grey dashed lines are the centerlines of road lanes. On the right figure, we display the meta-modes (3 solid lines ending with a cross) inferred from the predictions and their associated predictions (3 of the same color). The hierarchical structure notably enables efficient prediction compression by taking only the meta-modes.
  • Figure 2: Block diagonal weight matrix for a Grouped Fully-Connected layer with $G=2$ groups.
  • Figure 3: Grouped Multi-head Attention layer operations for $H=2$ and $G=3$.Step 1 computes the query, key, and value features for each head using grouped fully-connected layers to ensure independence between all three groups. Step 2 depicts attention mechanisms for each group in each head. Step 3 illustrates the projection of the concatenation of the heads using a grouped fully connected layer.
  • Figure 4: Examples of forecasts of an AutoBots model trained using the HWTA loss on the Argoverse 1 dataset. It has three meta-modes with two sub-modes each. The observed trajectory of the agent of interest is depicted with a dark blue solid line, while the grey dashed lines are the centerlines of road lanes. We represent the model's forecasts in cyan, magenta, and yellow. Solid lines are meta-modes and dashed lines are sub-modes.
  • Figure 5: Temporal 2D distributions
  • ...and 4 more figures