Table of Contents
Fetching ...

UniMM-V2X: MoE-Enhanced Multi-Level Fusion for End-to-End Cooperative Autonomous Driving

Ziyi Song, Chen Xia, Chenbing Wang, Haibao Yu, Sheng Zhou, Zhisheng Niu

TL;DR

UniMM-V2X addresses the challenge of robust, end-to-end multi-agent autonomous driving by introducing MoE-enhanced multi-level fusion that cooperates across perception and prediction to support planning. The framework places mixtures of experts in both the BEV encoder and the motion decoder, enabling task-specific feature representations and diverse motion queries, while performing explicit perception-level and prediction-level fusion via TrackFusion/TrajFusion. Empirical results on DAIR-V2X and V2X-Sim demonstrate state-of-the-art improvements in detection, tracking, motion prediction, and planning, with substantial reductions in collision rate and planning error, alongside favorable communication-efficiency trade-offs. These findings highlight the practical viability of scalable, cooperative end-to-end driving that adapts to downstream tasks and complex multi-agent dynamics, offering a new direction for reliable real-world deployment.

Abstract

Autonomous driving holds transformative potential but remains fundamentally constrained by the limited perception and isolated decision-making with standalone intelligence. While recent multi-agent approaches introduce cooperation, they often focus merely on perception-level tasks, overlooking the alignment with downstream planning and control, or fall short in leveraging the full capacity of the recent emerging end-to-end autonomous driving. In this paper, we present UniMM-V2X, a novel end-to-end multi-agent framework that enables hierarchical cooperation across perception, prediction, and planning. At the core of our framework is a multi-level fusion strategy that unifies perception and prediction cooperation, allowing agents to share queries and reason cooperatively for consistent and safe decision-making. To adapt to diverse downstream tasks and further enhance the quality of multi-level fusion, we incorporate a Mixture-of-Experts (MoE) architecture to dynamically enhance the BEV representations. We further extend MoE into the decoder to better capture diverse motion patterns. Extensive experiments on the DAIR-V2X dataset demonstrate our approach achieves state-of-the-art (SOTA) performance with a 39.7% improvement in perception accuracy, a 7.2% reduction in prediction error, and a 33.2% improvement in planning performance compared with UniV2X, showcasing the strength of our MoE-enhanced multi-level cooperative paradigm.

UniMM-V2X: MoE-Enhanced Multi-Level Fusion for End-to-End Cooperative Autonomous Driving

TL;DR

UniMM-V2X addresses the challenge of robust, end-to-end multi-agent autonomous driving by introducing MoE-enhanced multi-level fusion that cooperates across perception and prediction to support planning. The framework places mixtures of experts in both the BEV encoder and the motion decoder, enabling task-specific feature representations and diverse motion queries, while performing explicit perception-level and prediction-level fusion via TrackFusion/TrajFusion. Empirical results on DAIR-V2X and V2X-Sim demonstrate state-of-the-art improvements in detection, tracking, motion prediction, and planning, with substantial reductions in collision rate and planning error, alongside favorable communication-efficiency trade-offs. These findings highlight the practical viability of scalable, cooperative end-to-end driving that adapts to downstream tasks and complex multi-agent dynamics, offering a new direction for reliable real-world deployment.

Abstract

Autonomous driving holds transformative potential but remains fundamentally constrained by the limited perception and isolated decision-making with standalone intelligence. While recent multi-agent approaches introduce cooperation, they often focus merely on perception-level tasks, overlooking the alignment with downstream planning and control, or fall short in leveraging the full capacity of the recent emerging end-to-end autonomous driving. In this paper, we present UniMM-V2X, a novel end-to-end multi-agent framework that enables hierarchical cooperation across perception, prediction, and planning. At the core of our framework is a multi-level fusion strategy that unifies perception and prediction cooperation, allowing agents to share queries and reason cooperatively for consistent and safe decision-making. To adapt to diverse downstream tasks and further enhance the quality of multi-level fusion, we incorporate a Mixture-of-Experts (MoE) architecture to dynamically enhance the BEV representations. We further extend MoE into the decoder to better capture diverse motion patterns. Extensive experiments on the DAIR-V2X dataset demonstrate our approach achieves state-of-the-art (SOTA) performance with a 39.7% improvement in perception accuracy, a 7.2% reduction in prediction error, and a 33.2% improvement in planning performance compared with UniV2X, showcasing the strength of our MoE-enhanced multi-level cooperative paradigm.

Paper Structure

This paper contains 35 sections, 22 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: V2X communication modes in the VICAD (Vehicle-to-Infrastructure Cooperation Autonomous Driving) problem dair. (a) Cooperative perception methods focus on multi-agent detection and tracking, but may not align with planning objectives. (b) Vanilla solutions fuse features directly to generate planning outputs, with limited interpretability and compromised safety. (c) Module results can be supervised, but only enable perception-level cooperation. (d) Our design employs multi-level, multi-agent cooperation that integrates perception and prediction to enable cooperative decision-making.
  • Figure 2: The overview of the UniMM-V2X framework. The system performs explicit multi-level fusion by integrating perception-level and prediction-level information from multiple agents to enhance downstream planning. Both the BEV encoder and motion decoder are equipped with MoE architectures, where the encoder generates task-adaptive BEV features tailored for various downstream tasks, and the decoder employs specialized experts to model diverse motion patterns, enhancing the effectiveness and adaptability of multi-level fusion for more robust planning performance. This unified MoE-enhanced multi-level fusion framework facilitates effective cooperation among agents throughout the entire autonomous driving pipeline.
  • Figure 3: MoE-enhanced encoder and decoder in UniMM-V2X. The encoder enriches BEV feature extraction for diverse downstream tasks (e.g., detection, tracking, mapping, motion prediction), while the decoder generates motion queries through motion-specific experts (e.g., going forward, turning left, turning right) to improve planning quality.
  • Figure 4: Multi-level fusion in UniMM-V2X. (a) Perception-level fusion introduces positional priors via reference point embeddings and uses attention-based dynamic fusion across agents. (b) Prediction-level fusion employs anchor-based embedding and dynamic fusion to support motion reasoning in complex multi-agent settings.
  • Figure 5: Performance under different communication constraints.
  • ...and 1 more figures