Table of Contents
Fetching ...

SkillVLA: Tackling Combinatorial Diversity in Dual-Arm Manipulation via Skill Reuse

Xuanran Zhai, Zekai Huang, Longyan Wu, Qianyou Zhao, Qiaojun Yu, Jieji Ren, Ce Hao, Harold Soh

TL;DR

It is argued that effective bimanual VLAs should support skill reuse - the ability to recombine previously learned single-arm skills across novel left-right pairings - thereby avoiding the need to separately learn every possible combination.

Abstract

Recent progress in vision-language-action (VLA) models has demonstrated strong potential for dual-arm manipulation, enabling complex behaviors and generalization to unseen environments. However, mainstream bimanual VLA formulations largely overlook the critical challenge of combinatorial diversity. Different pairings of single-arm behaviors can induce qualitatively distinct task behaviors, yet existing models do not explicitly account for this structure. We argue that effective bimanual VLAs should support skill reuse - the ability to recombine previously learned single-arm skills across novel left-right pairings - thereby avoiding the need to separately learn every possible combination. Current VLA designs entangle skills across arms, preventing such recomposition and limiting scalability. To address this limitation, we propose SkillVLA, a framework explicitly designed to enable skill reuse in dual-arm manipulation. Extensive experiments demonstrate that SkillVLA substantially improves skill composition, increasing overall success rate from 0% to 51%, and achieves strong performance on cooperative and long-horizon tasks.

SkillVLA: Tackling Combinatorial Diversity in Dual-Arm Manipulation via Skill Reuse

TL;DR

It is argued that effective bimanual VLAs should support skill reuse - the ability to recombine previously learned single-arm skills across novel left-right pairings - thereby avoiding the need to separately learn every possible combination.

Abstract

Recent progress in vision-language-action (VLA) models has demonstrated strong potential for dual-arm manipulation, enabling complex behaviors and generalization to unseen environments. However, mainstream bimanual VLA formulations largely overlook the critical challenge of combinatorial diversity. Different pairings of single-arm behaviors can induce qualitatively distinct task behaviors, yet existing models do not explicitly account for this structure. We argue that effective bimanual VLAs should support skill reuse - the ability to recombine previously learned single-arm skills across novel left-right pairings - thereby avoiding the need to separately learn every possible combination. Current VLA designs entangle skills across arms, preventing such recomposition and limiting scalability. To address this limitation, we propose SkillVLA, a framework explicitly designed to enable skill reuse in dual-arm manipulation. Extensive experiments demonstrate that SkillVLA substantially improves skill composition, increasing overall success rate from 0% to 51%, and achieves strong performance on cooperative and long-horizon tasks.
Paper Structure (28 sections, 21 equations, 9 figures, 4 tables)

This paper contains 28 sections, 21 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: SkillVLA extracts single-arm skills from training data with hierarchical reasoning and skill-adaptive learning, being able to recompose them into unseen combinations during test time.
  • Figure 2: SkillVLA framework. SkillVLA adopts a two-level reasoning pipeline, where the high-level VLM generates separate subtasks for arms and low-level VLMs further process the prompts to instruct action generation. Inter-arm cross-attention enables cooperative behaviors generation, controlled by a collaboration estimator that identifies the operation mode required.
  • Figure 3: Skill recomposition tasks.(A): The models are trained on demonstrations of three skills for each arm. (B): After the models have learned the skills, zero-shot tests are conducted for every possible combinations of left and right-arm skills.
  • Figure 4: Cooperative tasks.(a)Shake: Shake the cup with a cap without making them fall apart. (b)Ball: Lift the ball steadily. (c)Align: Align the blocks on the table.
  • Figure 5: Long-horizon tasks behaviors and results.Top: Behavior of $\pi_{0.5}$ in Tubes. Middle left: Behavior of SkillVLA in Tubes. Bottom left: Changes of $\alpha$ values throughout the completion, respectively from SkillVLA and an ablated version without discretization of $\alpha$. Bottom right: Averaged progress score and completion time of methods on the long-horizon tasks.
  • ...and 4 more figures

Theorems & Definitions (3)

  • Definition 1: Single-Arm Skills
  • Definition 2: Dual-Arm Skills
  • Definition 3: Skill Reuse