RedMotion: Motion Prediction via Redundancy Reduction

Royden Wagner; Omer Sahin Tas; Marvin Klemp; Carlos Fernandez; Christoph Stiller

RedMotion: Motion Prediction via Redundancy Reduction

Royden Wagner, Omer Sahin Tas, Marvin Klemp, Carlos Fernandez, Christoph Stiller

TL;DR

RedMotion tackles self-driving motion prediction by learning augmentation-invariant road-environment representations through two redundancy-reduction mechanisms: (i) an internal transformer-based RED token decoder that compresses a variable local token set into a fixed global embedding, and (ii) Road Barlow Twins self-supervision applied to embeddings from augmented views. The architecture combines a trajectory encoder with a road-environment encoder that produces local and global context, fusing them via efficient cross-attention to generate multiple trajectory proposals. Empirical results on Waymo Open Motion and Argoverse 2 show improvements over contrastive, self-distillation, and masked-autoencoding baselines, with competitive performance against HPTR and MTR++ in Waymo. The work provides a universal approach to convert variable-length local road context into stable global representations, enabling more data-efficient pre-training for motion prediction and offering open-source code for adoption across multi-modal inputs in autonomous driving scenarios.

Abstract

We introduce RedMotion, a transformer model for motion prediction in self-driving vehicles that learns environment representations via redundancy reduction. Our first type of redundancy reduction is induced by an internal transformer decoder and reduces a variable-sized set of local road environment tokens, representing road graphs and agent data, to a fixed-sized global embedding. The second type of redundancy reduction is obtained by self-supervised learning and applies the redundancy reduction principle to embeddings generated from augmented views of road environments. Our experiments reveal that our representation learning approach outperforms PreTraM, Traj-MAE, and GraphDINO in a semi-supervised setting. Moreover, RedMotion achieves competitive results compared to HPTR or MTR++ in the Waymo Motion Prediction Challenge. Our open-source implementation is available at: https://github.com/kit-mrt/future-motion

RedMotion: Motion Prediction via Redundancy Reduction

TL;DR

Abstract

Paper Structure (14 sections, 9 figures, 5 tables)

This paper contains 14 sections, 9 figures, 5 tables.

Introduction
Related work
Method
Redundancy reduction for learning rich representations of road environments
Road environment description and motion prediction model
Experiments
Comparing pre-training methods for motion prediction
Comparing motion prediction models
Contribution and future work
Additional qualitative results
Limitations
Increasing the number of trajectory proposals
Inference time
Challenge results in detail

Figures (9)

Figure 1: RedMotion. Our model consists of two encoders. The trajectory encoder generates an embedding for the past trajectory of the current agent. The road environment encoder generates sets of local and global road environment embeddings as context. We use two redundancy reduction mechanisms, (a) and (b), to learn rich representations of road environments. All embeddings are fused via cross-attention to yield trajectory proposals per agent.
Figure 2: Road environment encoder. The circles in the local road graphs denote the maximum distance for considered lane network (outer) and agent nodes (inner). $\mathcal{L_{BT}}$ is the Barlow Twins loss, $L$ is the number of modules.
Figure 3: Road environment description. Local road environments are first represented as lane graphs with agents, afterwards, we generate token sets as inputs by using embedding tables for semantic types and temporal context. Positions are encoded relative to the current agent, except for RED tokens, which contain learned positional embeddings.
Figure 4: Receptive field of a traffic lane token. It expands in subsequent local attention layers, thereby enabling the token to gather information from related tokens within a larger surrounding area. Consequently, the road environment tokens initially form a disconnected graph and gradually transform into a fully connected graph. Best viewed, zoomed in.
Figure 5: Efficient cross-attention for feature fusion. The output sequence contains features from the past trajectory, the local road environment, and global RED tokens.
...and 4 more figures

RedMotion: Motion Prediction via Redundancy Reduction

TL;DR

Abstract

RedMotion: Motion Prediction via Redundancy Reduction

Authors

TL;DR

Abstract

Table of Contents

Figures (9)