Table of Contents
Fetching ...

MoST: Multi-modality Scene Tokenization for Motion Prediction

Norman Mu, Jingwei Ji, Zhenpei Yang, Nate Harada, Haotian Tang, Kan Chen, Charles R. Qi, Runzhou Ge, Kratarth Goel, Zoey Yang, Scott Ettinger, Rami Al-Rfou, Dragomir Anguelov, Yin Zhou

TL;DR

MoST introduces a hybrid motion-prediction paradigm that tokenizes multi-modal sensor data into scene elements (ground, agents, open-set objects) encoded by large image foundation models and LiDAR networks. This token-based representation integrates open-world semantic knowledge with geometry and scales to multi-frame observations, enabling transformer-based predictors to outperform state-of-the-art baselines on the augmented WOMD with camera embeddings. The main contributions include releasing WOMD camera embeddings, analyzing multi-modal modeling choices, and demonstrating robust, end-to-end compatible performance improvements across standard metrics and challenging scenarios. The approach offers a scalable, interpretable bridge between perception outputs and behavior modeling with practical impact for real-world autonomous driving systems.

Abstract

Many existing motion prediction approaches rely on symbolic perception outputs to generate agent trajectories, such as bounding boxes, road graph information and traffic lights. This symbolic representation is a high-level abstraction of the real world, which may render the motion prediction model vulnerable to perception errors (e.g., failures in detecting open-vocabulary obstacles) while missing salient information from the scene context (e.g., poor road conditions). An alternative paradigm is end-to-end learning from raw sensors. However, this approach suffers from the lack of interpretability and requires significantly more training resources. In this work, we propose tokenizing the visual world into a compact set of scene elements and then leveraging pre-trained image foundation models and LiDAR neural networks to encode all the scene elements in an open-vocabulary manner. The image foundation model enables our scene tokens to encode the general knowledge of the open world while the LiDAR neural network encodes geometry information. Our proposed representation can efficiently encode the multi-frame multi-modality observations with a few hundred tokens and is compatible with most transformer-based architectures. To evaluate our method, we have augmented Waymo Open Motion Dataset with camera embeddings. Experiments over Waymo Open Motion Dataset show that our approach leads to significant performance improvements over the state-of-the-art.

MoST: Multi-modality Scene Tokenization for Motion Prediction

TL;DR

MoST introduces a hybrid motion-prediction paradigm that tokenizes multi-modal sensor data into scene elements (ground, agents, open-set objects) encoded by large image foundation models and LiDAR networks. This token-based representation integrates open-world semantic knowledge with geometry and scales to multi-frame observations, enabling transformer-based predictors to outperform state-of-the-art baselines on the augmented WOMD with camera embeddings. The main contributions include releasing WOMD camera embeddings, analyzing multi-modal modeling choices, and demonstrating robust, end-to-end compatible performance improvements across standard metrics and challenging scenarios. The approach offers a scalable, interpretable bridge between perception outputs and behavior modeling with practical impact for real-world autonomous driving systems.

Abstract

Many existing motion prediction approaches rely on symbolic perception outputs to generate agent trajectories, such as bounding boxes, road graph information and traffic lights. This symbolic representation is a high-level abstraction of the real world, which may render the motion prediction model vulnerable to perception errors (e.g., failures in detecting open-vocabulary obstacles) while missing salient information from the scene context (e.g., poor road conditions). An alternative paradigm is end-to-end learning from raw sensors. However, this approach suffers from the lack of interpretability and requires significantly more training resources. In this work, we propose tokenizing the visual world into a compact set of scene elements and then leveraging pre-trained image foundation models and LiDAR neural networks to encode all the scene elements in an open-vocabulary manner. The image foundation model enables our scene tokens to encode the general knowledge of the open world while the LiDAR neural network encodes geometry information. Our proposed representation can efficiently encode the multi-frame multi-modality observations with a few hundred tokens and is compatible with most transformer-based architectures. To evaluate our method, we have augmented Waymo Open Motion Dataset with camera embeddings. Experiments over Waymo Open Motion Dataset show that our approach leads to significant performance improvements over the state-of-the-art.
Paper Structure (29 sections, 2 equations, 8 figures, 9 tables)

This paper contains 29 sections, 2 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Overview of the proposed motion prediction paradigm. It fuses symbolic perception output and our multi-modality scene tokens. While symbolic representation offers a convenient world abstraction, the multi-modality scene tokens links behavior models directly to sensor observations via token embeddings.
  • Figure 2: Overview of the proposed Multi-modality Scene Tokenization. Our method takes as input multi-view camera images and a full scene point cloud. We leverage a pre-trained image foundation model to obtain descriptive feature maps and decompose the scene into disjoint elements via clustering. Based on the sensor calibration information between camera and LiDAR, we obtain point-wise image features. From scene decomposition, we assign each point with a token/cluster id and derive box information for each element. Finally, we extract one feature embedding for each scene element.
  • Figure 3: Visualization of scene decomposition. We decompose a scene into agent elements, open-set elements and ground elements. We also visualize the perception bounding boxes for agents.
  • Figure 4: Scene element feature extraction. Scene-element feature is derived from a spatial-temporal module that fusing together image feature, geometry feature and temporal embedding. Image feature contains pooled feature from large pre-trained image encoder, and characterize the appearance and semantic attribute of the scene element. Geometry feature, on the other hand, characterizes the spatial location as well as the detailed geometry. Temporal information is injected through a learned temporal embedding.
  • Figure 5: Qualitative comparison. The agent boxes are colored by their types: gray for vehicle, red for pedestrian, and cyan for cyclist. The predicted trajectories are ordered temporally from green to blue. For each modeled agent, the models predict 6 trajectory candidates, whose confidence scores are illustrated by transparency: the more confident, the more visible. Ground truth trajectory is shown as red dots. In the upper example, MoST rules out the possibility that a vehicle runs onto a wall after U-turn; in the lower example, MoST correctly predicts that a cyclist could suddenly cross the street.
  • ...and 3 more figures