Can you text what is happening? Integrating pre-trained language encoders into trajectory prediction models for autonomous driving

Ali Keysan; Andreas Look; Eitan Kosman; Gonca Gürsun; Jörg Wagner; Yu Yao; Barbara Rakitsch

Can you text what is happening? Integrating pre-trained language encoders into trajectory prediction models for autonomous driving

Ali Keysan, Andreas Look, Eitan Kosman, Gonca Gürsun, Jörg Wagner, Yu Yao, Barbara Rakitsch

TL;DR

Problem: trajectory prediction in autonomous driving requires rich scene representations. Approach: integrate pre-trained language encoders with text-based scene prompts (including Bézier lane encodings) alongside rasterized images. Findings: text encoders produce meaningful scene representations and a joint image-text encoder yields the strongest performance on nuScenes, though not yet state-of-the-art. Significance: demonstrates a viable path toward more interpretable and expressive prediction models and motivates further exploration of LM-based scene encoders in autonomous driving.

Abstract

In autonomous driving tasks, scene understanding is the first step towards predicting the future behavior of the surrounding traffic participants. Yet, how to represent a given scene and extract its features are still open research questions. In this study, we propose a novel text-based representation of traffic scenes and process it with a pre-trained language encoder. First, we show that text-based representations, combined with classical rasterized image representations, lead to descriptive scene embeddings. Second, we benchmark our predictions on the nuScenes dataset and show significant improvements compared to baselines. Third, we show in an ablation study that a joint encoder of text and rasterized images outperforms the individual encoders confirming that both representations have their complementary strengths.

Can you text what is happening? Integrating pre-trained language encoders into trajectory prediction models for autonomous driving

TL;DR

Abstract

Paper Structure (17 sections, 2 figures, 1 table)

This paper contains 17 sections, 2 figures, 1 table.

INTRODUCTION
Background
CoverNet
Model Architecture
Image Encoder
Text Encoder
Joint Encoder
Scene Representation
Image Representation
Text Representation
Experiments
nuScenes Dataset
Fine-Tuning
Evaluation Metrics
Empirical Analysis
...and 2 more sections

Figures (2)

Figure 1: Flow of our Model. We encode the image that represents the rasterized scene and the text prompt with pre-trained models dedicated for each modality. If both input sources are used, we afterwards concatenate their embeddings. The result is fed into a decoder whose final layer picks the target trajectory from a pre generated trajectory set.
Figure 2: Example Prompt. Our prompt contains information about the agent state, its history and lane information. We use a compact lane encoding with the help of Bézier curves.

Can you text what is happening? Integrating pre-trained language encoders into trajectory prediction models for autonomous driving

TL;DR

Abstract

Can you text what is happening? Integrating pre-trained language encoders into trajectory prediction models for autonomous driving

Authors

TL;DR

Abstract

Table of Contents

Figures (2)