Table of Contents
Fetching ...

DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action

Zhen Fang, Zhuoyang Liu, Jiaming Liu, Hao Chen, Yu Zeng, Shiting Huang, Zehui Chen, Lin Chen, Shanghang Zhang, Feng Zhao

TL;DR

DualVLA tackles action degeneration that arises when enriching specialist Vision-Language-Action models with multimodal reasoning. It introduces a post-training framework with dual-layer data pruning to remove redundant embodied reasoning and a dual-teacher adaptive distillation to provide domain-aligned supervision for action and reasoning, respectively. To enable fine-grained assessment, it also proposes VLA Score, a retrieval-augmented evaluation pipeline across reasoning, action, intention, and alignment. Empirical results demonstrate stronger action execution without sacrificing reasoning across simulation and real-world tasks, validating the approach and its evaluation paradigm for generalizable embodied VLA systems.

Abstract

To build a generalizable Vision-Language-Action (VLA) model with strong reasoning ability, a common strategy is to first train a specialist VLA on robot demonstrations to acquire reliable manipulation skills, and then incorporate mixed annotated robot data together with multimodal data to restore broader reasoning capabilities. However, we observe that the resulting reasoning VLA often suffers from degraded action performance compared to the specialist model before fine-tuning, a phenomenon we refer to as action degeneration. To address this issue, we propose DualVLA, which enhances action performance through carefully designed post-training while still preserving reasoning capability. We first introduce a dual-layer data pruning method that removes redundant embodied reasoning, preventing it from adversely influencing action learning. To further strengthen action generation, we design a dual-teacher adaptive distillation strategy that assigns different supervision signals to different data domains while maintaining reasoning ability. To fill the evaluation gap for generalist VLAs, we also propose VLA Score, which decouples VLA capability into reasoning, intention, action, and alignment dimensions for a more fine-grained assessment. Experiments show that DualVLA achieves an average success rate of 61.0 in SimplerEnv and an average score of 65.4 across eight competitive multimodal benchmarks, demonstrating a stronger balance between precise action execution and multimodal understanding. Project Website: https://costaliya.github.io/DualVLA/.

DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action

TL;DR

DualVLA tackles action degeneration that arises when enriching specialist Vision-Language-Action models with multimodal reasoning. It introduces a post-training framework with dual-layer data pruning to remove redundant embodied reasoning and a dual-teacher adaptive distillation to provide domain-aligned supervision for action and reasoning, respectively. To enable fine-grained assessment, it also proposes VLA Score, a retrieval-augmented evaluation pipeline across reasoning, action, intention, and alignment. Empirical results demonstrate stronger action execution without sacrificing reasoning across simulation and real-world tasks, validating the approach and its evaluation paradigm for generalizable embodied VLA systems.

Abstract

To build a generalizable Vision-Language-Action (VLA) model with strong reasoning ability, a common strategy is to first train a specialist VLA on robot demonstrations to acquire reliable manipulation skills, and then incorporate mixed annotated robot data together with multimodal data to restore broader reasoning capabilities. However, we observe that the resulting reasoning VLA often suffers from degraded action performance compared to the specialist model before fine-tuning, a phenomenon we refer to as action degeneration. To address this issue, we propose DualVLA, which enhances action performance through carefully designed post-training while still preserving reasoning capability. We first introduce a dual-layer data pruning method that removes redundant embodied reasoning, preventing it from adversely influencing action learning. To further strengthen action generation, we design a dual-teacher adaptive distillation strategy that assigns different supervision signals to different data domains while maintaining reasoning ability. To fill the evaluation gap for generalist VLAs, we also propose VLA Score, which decouples VLA capability into reasoning, intention, action, and alignment dimensions for a more fine-grained assessment. Experiments show that DualVLA achieves an average success rate of 61.0 in SimplerEnv and an average score of 65.4 across eight competitive multimodal benchmarks, demonstrating a stronger balance between precise action execution and multimodal understanding. Project Website: https://costaliya.github.io/DualVLA/.

Paper Structure

This paper contains 34 sections, 11 equations, 16 figures, 8 tables.

Figures (16)

  • Figure 1: DualVLA first constructs a sparse, information-dense embodied reasoning dataset by combining video event prediction with kinematic cues, mitigating the negative impact of redundant reasoning on action generation. It then adopts a dual-teacher strategy: an action teacher offering fine-grained supervision for manipulation, and a reasoning teacher maintaining general reasoning capability. Together, these components enable DualVLA to achieve strong performance in both simulation and real-world robotic evaluations.
  • Figure 2: VLMs possess strong reasoning ability but lack action skills. Specialist VLAs achieve strong action capability but lose general reasoning. Reasoning VLAs partially recover reasoning through additional supervision, yet their action performance drops, illustrating the action degeneration problem. Our goal is to build a model that excels at both reasoning and action simultaneously.
  • Figure 3: Overview of VLA Score evaluation pipeline. Given the policy trajectory, task description, and optional reasoning as input, VLA Score first performs dual retrieval to fetch task-relevant textual examples and visually similar trajectories from a curated knowledge base. The retrieved samples serve as few-shot context for the VLM judge, which evaluates the trajectory along four dimensions: Reasoning, Action, Intention, and Alignment. These scores are then combined with the simulation outcome to produce the final VLA Score.
  • Figure 4: Visualization of the two real-world task progress.
  • Figure 5: Ablation for distillation.
  • ...and 11 more figures