Table of Contents
Fetching ...

SAMoE-VLA: A Scene Adaptive Mixture-of-Experts Vision-Language-Action Model for Autonomous Driving

Zihan You, Hongwei Liu, Chenxu Dang, Zhe Wang, Sining Ang, Aoqi Wang, Yan Wang

Abstract

Recent advances in Vision-Language-Action (VLA) models have shown promising capabilities in autonomous driving by leveraging the understanding and reasoning strengths of Large Language Models(LLMs).However, our empirical analysis reveals that directly applying existing token-level MoE mechanisms--which are inherited from LLM architectures--to VLA models results in unstable performance and safety degradation in autonomous driving, highlighting a misalignment between token-based expert specialization and scene-level decision-making.To address this, we propose SAMoE-VLA, a scene-adaptive Vision-Language-Action framework that conditions expert selection on structured scene representations instead of token embeddings. Our key idea is to derive the MoE routing signal from bird's-eye-view (BEV) features that encapsulates traffic scene context, enabling scenario-dependent expert weighting and merging tailored to distinct driving conditions. Furthermore, to support temporally consistent reasoning across world-knowledge, perception, language, and action, we introduce a Conditional Cross-Modal Causal Attention mechanism that integrates world state, linguistic intent, and action history into a unified causal reasoning process. Extensive experiments on the nuScenes open loop planning dataset and LangAuto closed-loop benchmark demonstrate that SAMoE-VLA achieves state-of-the-art performance, outperforming prior VLA-based and world-model-based approaches with fewer parameters.Our code will be released soon.

SAMoE-VLA: A Scene Adaptive Mixture-of-Experts Vision-Language-Action Model for Autonomous Driving

Abstract

Recent advances in Vision-Language-Action (VLA) models have shown promising capabilities in autonomous driving by leveraging the understanding and reasoning strengths of Large Language Models(LLMs).However, our empirical analysis reveals that directly applying existing token-level MoE mechanisms--which are inherited from LLM architectures--to VLA models results in unstable performance and safety degradation in autonomous driving, highlighting a misalignment between token-based expert specialization and scene-level decision-making.To address this, we propose SAMoE-VLA, a scene-adaptive Vision-Language-Action framework that conditions expert selection on structured scene representations instead of token embeddings. Our key idea is to derive the MoE routing signal from bird's-eye-view (BEV) features that encapsulates traffic scene context, enabling scenario-dependent expert weighting and merging tailored to distinct driving conditions. Furthermore, to support temporally consistent reasoning across world-knowledge, perception, language, and action, we introduce a Conditional Cross-Modal Causal Attention mechanism that integrates world state, linguistic intent, and action history into a unified causal reasoning process. Extensive experiments on the nuScenes open loop planning dataset and LangAuto closed-loop benchmark demonstrate that SAMoE-VLA achieves state-of-the-art performance, outperforming prior VLA-based and world-model-based approaches with fewer parameters.Our code will be released soon.
Paper Structure (107 sections, 3 theorems, 188 equations, 17 figures, 8 tables, 5 algorithms)

This paper contains 107 sections, 3 theorems, 188 equations, 17 figures, 8 tables, 5 algorithms.

Key Result

proposition 1

Under local routing, the minimal deviation from the scene-optimal mixture satisfies Under both local routing and top-$k$ sparsification,

Figures (17)

  • Figure 1: Overview of our SAMoE-VLA. SAMoE-VLA employs two functional experts. A World-Language Expert: This module performs multimodal processing by integrating tokenized human instructions, Bird's-Eye-View (BEV) tokens and soft prompts for world embeddings. A Planning Expert: This expert utilizes a structure based on a scene adaptive Mixture-of-Experts (SAMoE) layers routed by the scene representation extracted from Deformable Scene Encoder and receives ego-state tokens and noisy action tokens as its input. Our model unifies these experts through Conditional Cross-Modal Causal Attention(CMCA).
  • Figure 2: Overview of our Scene Adaptive MoE guided by Deformable Scene Encoder. SA-MoE is the layer of our proposed planning expert shown in figure \ref{['fig:pipeline']}. BEV hidden is calculated only once during inference, while expert weights need to be calculated in every layer. Each layer has its own Linear head in Eq. (8) so weights vary per layer.
  • Figure 3: Radar chart of different MoE mechanism experiment results. Note that the distribution places smaller values near edges.
  • Figure 4: t-SNE visualization of mean-pooled BEV hidden representations, colored by KMeans clusters, showing distinct feature regimes and corresponding MoE routing preferences.
  • Figure 5: Comparison of average L2 distance and success rate with and without the scene adaptive soft weighted MoE layer.
  • ...and 12 more figures

Theorems & Definitions (4)

  • proof : Sketch of Proof
  • proposition 1: Token-Level Routing Gap
  • theorem 1: Trajectory-Level Structural Disruption
  • theorem 2: Variance Reduction of Scene-Level Routing