Conditional Unscented Autoencoders for Trajectory Prediction

Faris Janjoš; Marcel Hallgarten; Anthony Knittel; Maxim Dolgov; Andreas Zell; J. Marius Zöllner

Conditional Unscented Autoencoders for Trajectory Prediction

Faris Janjoš, Marcel Hallgarten, Anthony Knittel, Maxim Dolgov, Andreas Zell, J. Marius Zöllner

TL;DR

This work identifies key CVAE limitations for probabilistic trajectory prediction, notably lack of likelihood evaluation, poor modeling of multi-modal futures, and high sampling variance. It introduces unscented latent-space sampling (CUAE), a Gaussian mixture latent space (GMM-CUAE), and a conditional ex-post (CXP) inference strategy, offering deterministic, expressive alternatives to random sampling. Across INTERACTION and CelebA tasks, the proposed methods outperform strong baselines, with the GMM latent space and unscented sampling delivering notable gains in trajectory quality and diversity, and CXP-based inference enabling more faithful conditioning on context. Collectively, the approach provides safer, more reliable, and scalable probabilistic prediction with implications for real-world autonomous systems and generative modeling broadly.

Abstract

The CVAE is one of the most widely-used models in trajectory prediction for AD. It captures the interplay between a driving context and its ground-truth future into a probabilistic latent space and uses it to produce predictions. In this paper, we challenge key components of the CVAE. We leverage recent advances in the space of the VAE, the foundation of the CVAE, which show that a simple change in the sampling procedure can greatly benefit performance. We find that unscented sampling, which draws samples from any learned distribution in a deterministic manner, can naturally be better suited to trajectory prediction than potentially dangerous random sampling. We go further and offer additional improvements including a more structured Gaussian mixture latent space, as well as a novel, potentially more expressive way to do inference with CVAEs. We show wide applicability of our models by evaluating them on the INTERACTION prediction dataset, outperforming the state of the art, as well as at the task of image modeling on the CelebA dataset, outperforming the baseline vanilla CVAE. Code is available at https://github.com/boschresearch/cuae-prediction.

Conditional Unscented Autoencoders for Trajectory Prediction

TL;DR

Abstract

Paper Structure (16 sections, 9 equations, 7 figures, 3 tables)

This paper contains 16 sections, 9 equations, 7 figures, 3 tables.

Introduction
Related Work
Method
Latent Space Sampling and Transformation
CVAE Background
Unscented Transform of the Latent Space
Latent Space Structure and Inference Strategy
GMM Latent Space
Conditional Ex-Post (CXP) Estimation
Output Trajectory Generation
Results
Implementation
Datasets and Training Setup
Image Modeling Performance
Trajectory Prediction Performance
...and 1 more sections

Figures (7)

Figure 1: Assume a trajectory predictor learned a multi-modal distribution (yellow), either by propagating its latent space or directly in the output space. Random sampling (black) can bring unsafe, unlikely, or in-between-mode outputs. In contrast, the unscented sampling (red), realized by computing sigma points of the distribution, brings structure to the learned stochasticity.
Figure 2: CVAE: in training, the model captures the joint distribution of the ground-truth trajectory and driving context via the encoder network $\phi$, samples randomly, and reconstructs trajectories via the decoder network $\theta$. In inference, the prior network $\gamma$ replaces $\phi$ and is sampled instead.
Figure 3: CUAE: instead of sampling the latent space randomly (in both training and inference), the model analytically computes sigma points of the $\phi$ and $\gamma$ distributions and transforms them instead.
Figure 4: GMM-CUAE: it structures the latent space into a GMM and separately transforms its components (sigma points shown). Compared to Fig. \ref{['fig:cuae']}, it has the potential to better model multi-modality.
Figure 5: Illustration of CXP joint mixture construction and conditional sampling. Top: all posterior and prior sigma points in training are concatenated, collected, and used to fit a mixture. Bottom: given a new example's prior encoding, the mixture is conditioned (intuitively, it is "cut"). The resulting lower-dim. mixture is sampled as input for the decoder.
...and 2 more figures

Conditional Unscented Autoencoders for Trajectory Prediction

TL;DR

Abstract

Conditional Unscented Autoencoders for Trajectory Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (7)