MeanFuser: Fast One-Step Multi-Modal Trajectory Generation and Adaptive Reconstruction via MeanFlow for End-to-End Autonomous Driving

Junli Wang; Xueyi Liu; Yinan Zheng; Zebing Xing; Pengfei Li; Guang Li; Kun Ma; Guang Chen; Hangjun Ye; Zhongpu Xia; Long Chen; Qichao Zhang

MeanFuser: Fast One-Step Multi-Modal Trajectory Generation and Adaptive Reconstruction via MeanFlow for End-to-End Autonomous Driving

Junli Wang, Xueyi Liu, Yinan Zheng, Zebing Xing, Pengfei Li, Guang Li, Kun Ma, Guang Chen, Hangjun Ye, Zhongpu Xia, Long Chen, Qichao Zhang

TL;DR

MeanFuser is proposed, an end-to-end autonomous driving method that enhances both efficiency and robustness through three key designs, offering a robust and efficient solution for end-to-end autonomous driving.

Abstract

Generative models have shown great potential in trajectory planning. Recent studies demonstrate that anchor-guided generative models are effective in modeling the uncertainty of driving behaviors and improving overall performance. However, these methods rely on discrete anchor vocabularies that must sufficiently cover the trajectory distribution during testing to ensure robustness, inducing an inherent trade-off between vocabulary size and model performance. To overcome this limitation, we propose MeanFuser, an end-to-end autonomous driving method that enhances both efficiency and robustness through three key designs. (1) We introduce Gaussian Mixture Noise (GMN) to guide generative sampling, enabling a continuous representation of the trajectory space and eliminating the dependency on discrete anchor vocabularies. (2) We adapt ``MeanFlow Identity" to end-to-end planning, which models the mean velocity field between GMN and trajectory distribution instead of the instantaneous velocity field used in vanilla flow matching methods, effectively eliminating numerical errors from ODE solvers and significantly accelerating inference. (3) We design a lightweight Adaptive Reconstruction Module (ARM) that enables the model to implicitly select from all sampled proposals or reconstruct a new trajectory when none is satisfactory via attention weights. Experiments on the NAVSIM closed-loop benchmark demonstrate that MeanFuser achieves outstanding performance without the supervision of the PDM Score. and exceptional inference efficiency, offering a robust and efficient solution for end-to-end autonomous driving. Our code and model are available at https://github.com/wjl2244/MeanFuser.

MeanFuser: Fast One-Step Multi-Modal Trajectory Generation and Adaptive Reconstruction via MeanFlow for End-to-End Autonomous Driving

TL;DR

Abstract

Paper Structure (22 sections, 17 equations, 8 figures, 8 tables)

This paper contains 22 sections, 17 equations, 8 figures, 8 tables.

Introduction
Related Work
End-to-End Autonomous Driving
Diffusion and Flow-Based Generative Models for Trajectory Planning
Candidate Trajectory Proposal Evaluation and Selection
Preliminary
Problem Formulation
Flow-Based Model
Method
Scene Context Encoder
Gaussian Mixture Noise
Multi-modal Trajectories Sample
Adaptive Reconstruction Module
Experiments
Dataset and Metrics
...and 7 more sections

Figures (8)

Figure 1: (a) illustrates the differences between our proposed method and existing generative approaches, highlighting the introduction of Gaussian mixture noise to replace anchor vocabularies, one-step sampling, and the adaptive reconstruction module. (b) shows the advantages of MeanFuser over GoalFlowgoalflow, Hydra-MDPHydra-MDP, and DiffusionDriveDiffusionDrive in terms of closed-loop performance, inference speed and plan module inference speed.
Figure 2: Failure scene visualization. Anchor-guided models (GoalFlow, DiffusionDrive) fail due to the inability of discrete vocabularies to cover the entire trajectory space, while our model generates proposals that encompass the optimal trajectories.
Figure 3: Overall architecture of MeanFuser. Training: During training, both the images and ego-vehicle states are encoded into context features, with auxiliary supervision from mapping and detection tasks. The model is conditioned on these context features to learn the average velocity field $u_{\theta}$ over the time interval $r$ and $t$. Multi-Modal Sample: Noise samples are drawn from Gaussian Mixture Noise, and the one-step sampling formulation is then applied to generate diverse multi-modal trajectories. Adaptive Reconstruction Module: The sampled multi-modal trajectories are encoded and fused with the context features through cross-attention, after which a Projector outputs the final planning trajectory.
Figure 4: Visualization of sampling from different Gaussian components. Parallel sampling of trajectories from distinct Gaussian components can generate diverse driving styles, ranging from conservative to aggressive.
Figure 5: Visualization of the multimodal trajectory of the model. The left image shows expert demonstration trajectories, while the right image displays the model’s inferred strategies for maintaining a straight path and performing a left lane change, representing two different modes.
...and 3 more figures

MeanFuser: Fast One-Step Multi-Modal Trajectory Generation and Adaptive Reconstruction via MeanFlow for End-to-End Autonomous Driving

TL;DR

Abstract

MeanFuser: Fast One-Step Multi-Modal Trajectory Generation and Adaptive Reconstruction via MeanFlow for End-to-End Autonomous Driving

Authors

TL;DR

Abstract

Table of Contents

Figures (8)