SAMoE-VLA: A Scene Adaptive Mixture-of-Experts Vision-Language-Action Model for Autonomous Driving

Zihan You; Hongwei Liu; Chenxu Dang; Zhe Wang; Sining Ang; Aoqi Wang; Yan Wang

SAMoE-VLA: A Scene Adaptive Mixture-of-Experts Vision-Language-Action Model for Autonomous Driving

Zihan You, Hongwei Liu, Chenxu Dang, Zhe Wang, Sining Ang, Aoqi Wang, Yan Wang

Abstract

Recent advances in Vision-Language-Action (VLA) models have shown promising capabilities in autonomous driving by leveraging the understanding and reasoning strengths of Large Language Models(LLMs).However, our empirical analysis reveals that directly applying existing token-level MoE mechanisms--which are inherited from LLM architectures--to VLA models results in unstable performance and safety degradation in autonomous driving, highlighting a misalignment between token-based expert specialization and scene-level decision-making.To address this, we propose SAMoE-VLA, a scene-adaptive Vision-Language-Action framework that conditions expert selection on structured scene representations instead of token embeddings. Our key idea is to derive the MoE routing signal from bird's-eye-view (BEV) features that encapsulates traffic scene context, enabling scenario-dependent expert weighting and merging tailored to distinct driving conditions. Furthermore, to support temporally consistent reasoning across world-knowledge, perception, language, and action, we introduce a Conditional Cross-Modal Causal Attention mechanism that integrates world state, linguistic intent, and action history into a unified causal reasoning process. Extensive experiments on the nuScenes open loop planning dataset and LangAuto closed-loop benchmark demonstrate that SAMoE-VLA achieves state-of-the-art performance, outperforming prior VLA-based and world-model-based approaches with fewer parameters.Our code will be released soon.

SAMoE-VLA: A Scene Adaptive Mixture-of-Experts Vision-Language-Action Model for Autonomous Driving

Abstract

Paper Structure (107 sections, 3 theorems, 188 equations, 17 figures, 8 tables, 5 algorithms)

This paper contains 107 sections, 3 theorems, 188 equations, 17 figures, 8 tables, 5 algorithms.

Introduction
Related Work
VLA for End-to-End Autonomous Driving
Mixture of Experts
Method
Method Architecture
Conditional Cross-Modal Causal Attention
Scene Adaptive MoE with Deformable Scene Encoder
Training Stages and Objective
Experiments
Datasets and Implementation Details
Main Results
Ablation Study
Conclusion
The implementation of Flow-Matching
...and 92 more sections

Key Result

proposition 1

Under local routing, the minimal deviation from the scene-optimal mixture satisfies Under both local routing and top-$k$ sparsification,

Figures (17)

Figure 1: Overview of our SAMoE-VLA. SAMoE-VLA employs two functional experts. A World-Language Expert: This module performs multimodal processing by integrating tokenized human instructions, Bird's-Eye-View (BEV) tokens and soft prompts for world embeddings. A Planning Expert: This expert utilizes a structure based on a scene adaptive Mixture-of-Experts (SAMoE) layers routed by the scene representation extracted from Deformable Scene Encoder and receives ego-state tokens and noisy action tokens as its input. Our model unifies these experts through Conditional Cross-Modal Causal Attention(CMCA).
Figure 2: Overview of our Scene Adaptive MoE guided by Deformable Scene Encoder. SA-MoE is the layer of our proposed planning expert shown in figure \ref{['fig:pipeline']}. BEV hidden is calculated only once during inference, while expert weights need to be calculated in every layer. Each layer has its own Linear head in Eq. (8) so weights vary per layer.
Figure 3: Radar chart of different MoE mechanism experiment results. Note that the distribution places smaller values near edges.
Figure 4: t-SNE visualization of mean-pooled BEV hidden representations, colored by KMeans clusters, showing distinct feature regimes and corresponding MoE routing preferences.
Figure 5: Comparison of average L2 distance and success rate with and without the scene adaptive soft weighted MoE layer.
...and 12 more figures

Theorems & Definitions (4)

proof : Sketch of Proof
proposition 1: Token-Level Routing Gap
theorem 1: Trajectory-Level Structural Disruption
theorem 2: Variance Reduction of Scene-Level Routing

SAMoE-VLA: A Scene Adaptive Mixture-of-Experts Vision-Language-Action Model for Autonomous Driving

Abstract

SAMoE-VLA: A Scene Adaptive Mixture-of-Experts Vision-Language-Action Model for Autonomous Driving

Authors

Abstract

Table of Contents

Key Result

Figures (17)

Theorems & Definitions (4)