Table of Contents
Fetching ...

Generalizing Motion Planners with Mixture of Experts for Autonomous Driving

Qiao Sun, Huimin Wang, Jiahao Zhan, Fan Nie, Xin Wen, Leimeng Xu, Kun Zhan, Peng Jia, Xianpeng Lang, Hang Zhao

TL;DR

This work targets generalization bottlenecks in data-driven motion planning for autonomous driving by scaling both data and model size. It introduces StateTransformer-2 (STR2), a decoder-only Mixture-of-Experts motion planner with a Vision Transformer encoder, trained in a self-supervised manner and capable of balancing multiple explicit rewards without reinforcement learning. Across NuPlan and LiAuto datasets, STR2 demonstrates superior generalization in open- and closed-loop tests, including challenging out-of-distribution and zero-shot scenarios, with performance improvements scaling with data and model size. The results suggest that large-scale MoE-based sequence models can outperform more complex architectures and training paradigms in planning tasks, offering practical benefits for robust autonomous driving systems.

Abstract

Large real-world driving datasets have sparked significant research into various aspects of data-driven motion planners for autonomous driving. These include data augmentation, model architecture, reward design, training strategies, and planner pipelines. These planners promise better generalizations on complicated and few-shot cases than previous methods. However, experiment results show that many of these approaches produce limited generalization abilities in planning performance due to overly complex designs or training paradigms. In this paper, we review and benchmark previous methods focusing on generalizations. The experimental results indicate that as models are appropriately scaled, many design elements become redundant. We introduce StateTransformer-2 (STR2), a scalable, decoder-only motion planner that uses a Vision Transformer (ViT) encoder and a mixture-of-experts (MoE) causal Transformer architecture. The MoE backbone addresses modality collapse and reward balancing by expert routing during training. Extensive experiments on the NuPlan dataset show that our method generalizes better than previous approaches across different test sets and closed-loop simulations. Furthermore, we assess its scalability on billions of real-world urban driving scenarios, demonstrating consistent accuracy improvements as both data and model size grow.

Generalizing Motion Planners with Mixture of Experts for Autonomous Driving

TL;DR

This work targets generalization bottlenecks in data-driven motion planning for autonomous driving by scaling both data and model size. It introduces StateTransformer-2 (STR2), a decoder-only Mixture-of-Experts motion planner with a Vision Transformer encoder, trained in a self-supervised manner and capable of balancing multiple explicit rewards without reinforcement learning. Across NuPlan and LiAuto datasets, STR2 demonstrates superior generalization in open- and closed-loop tests, including challenging out-of-distribution and zero-shot scenarios, with performance improvements scaling with data and model size. The results suggest that large-scale MoE-based sequence models can outperform more complex architectures and training paradigms in planning tasks, offering practical benefits for robust autonomous driving systems.

Abstract

Large real-world driving datasets have sparked significant research into various aspects of data-driven motion planners for autonomous driving. These include data augmentation, model architecture, reward design, training strategies, and planner pipelines. These planners promise better generalizations on complicated and few-shot cases than previous methods. However, experiment results show that many of these approaches produce limited generalization abilities in planning performance due to overly complex designs or training paradigms. In this paper, we review and benchmark previous methods focusing on generalizations. The experimental results indicate that as models are appropriately scaled, many design elements become redundant. We introduce StateTransformer-2 (STR2), a scalable, decoder-only motion planner that uses a Vision Transformer (ViT) encoder and a mixture-of-experts (MoE) causal Transformer architecture. The MoE backbone addresses modality collapse and reward balancing by expert routing during training. Extensive experiments on the NuPlan dataset show that our method generalizes better than previous approaches across different test sets and closed-loop simulations. Furthermore, we assess its scalability on billions of real-world urban driving scenarios, demonstrating consistent accuracy improvements as both data and model size grow.

Paper Structure

This paper contains 16 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: The planning results, in red, from PDM-Hybrid and STR2 at the pickup area at the top, and an illustration of the MoE model learning and balancing different explicit rewards at the bottom. In this case, STR2 produces a better-nudging trajectory by balancing two conflicting rewards, making progress, and avoiding collisions.
  • Figure 2: An overview of the STR2-CPKS model which has a sequence of context, proposal, key points, and future states for the MoE backbone to model. For STR2-CKS, proposals are removed in the sequence for better efficiency. The context part has rasterized environment information encoded by scalable ViT encoders and past ego states.
  • Figure 3: Scaling results with the size of the training dataset, counted as the number of tokens $D$ in the left and scaling results with model parameters $N$ in the right. All axes are logarithmically scaled.