Table of Contents
Fetching ...

TwinVLA: Data-Efficient Bimanual Manipulation with Twin Single-Arm Vision-Language-Action Models

Hokyun Im, Euijin Jeong, Jianlong Fu, Andrey Kolobov, Youngwoon Lee

TL;DR

TwinVLA tackles the data scarcity of bimanual robotic datasets by reusing abundant single-arm VLA data. It duplicates a pretrained SingleVLA into two arms and coordinates them with a lightweight joint-attention MoE framework, avoiding large-scale bimanual pretraining. The approach achieves competitive performance with significantly less bimanual data and compute across real and simulated tasks, narrowing the gap to state-of-the-art models that rely on proprietary data. This modular, data-efficient strategy demonstrates a scalable path toward high-performance bimanual manipulation and could generalize to other embodied tasks.

Abstract

Vision-language-action models (VLAs) trained on large-scale robotic datasets have demonstrated strong performance on manipulation tasks, including bimanual tasks. However, because most public datasets focus on single-arm demonstrations, adapting VLAs for bimanual tasks typically requires substantial additional bimanual data and fine-tuning. To address this challenge, we introduce TwinVLA, a modular framework that composes two copies of a pretrained single-arm VLA into a coordinated bimanual VLA. Unlike monolithic cross-embodiment models trained on mixtures of single-arm and bimanual data, TwinVLA improves both data efficiency and performance by composing pretrained single-arm policies. Across diverse bimanual tasks in real-world and simulation settings, TwinVLA outperforms a comparably-sized monolithic RDT-1B model without requiring any bimanual pretraining. Furthermore, it narrows the gap to state-of-the-art model, $π_0$ which rely on extensive proprietary bimanual data and compute cost. These results establish our modular composition approach as a data-efficient and scalable path toward high-performance bimanual manipulation, leveraging public single-arm data.

TwinVLA: Data-Efficient Bimanual Manipulation with Twin Single-Arm Vision-Language-Action Models

TL;DR

TwinVLA tackles the data scarcity of bimanual robotic datasets by reusing abundant single-arm VLA data. It duplicates a pretrained SingleVLA into two arms and coordinates them with a lightweight joint-attention MoE framework, avoiding large-scale bimanual pretraining. The approach achieves competitive performance with significantly less bimanual data and compute across real and simulated tasks, narrowing the gap to state-of-the-art models that rely on proprietary data. This modular, data-efficient strategy demonstrates a scalable path toward high-performance bimanual manipulation and could generalize to other embodied tasks.

Abstract

Vision-language-action models (VLAs) trained on large-scale robotic datasets have demonstrated strong performance on manipulation tasks, including bimanual tasks. However, because most public datasets focus on single-arm demonstrations, adapting VLAs for bimanual tasks typically requires substantial additional bimanual data and fine-tuning. To address this challenge, we introduce TwinVLA, a modular framework that composes two copies of a pretrained single-arm VLA into a coordinated bimanual VLA. Unlike monolithic cross-embodiment models trained on mixtures of single-arm and bimanual data, TwinVLA improves both data efficiency and performance by composing pretrained single-arm policies. Across diverse bimanual tasks in real-world and simulation settings, TwinVLA outperforms a comparably-sized monolithic RDT-1B model without requiring any bimanual pretraining. Furthermore, it narrows the gap to state-of-the-art model, which rely on extensive proprietary bimanual data and compute cost. These results establish our modular composition approach as a data-efficient and scalable path toward high-performance bimanual manipulation, leveraging public single-arm data.

Paper Structure

This paper contains 37 sections, 2 equations, 15 figures, 9 tables, 2 algorithms.

Figures (15)

  • Figure 1: Overview of TwinVLA. Inspired by humans' two-arm coordination for bimanual manipulation, TwinVLA duplicates a VLM backbone pretrained on cross-embodiment single-arm data (Left) to form two arm-specific branches linked via Joint Attention (Right). Shared inputs (ego-centric views, language instructions) are routed via a mixture-of-experts (MoE) to improve computational efficiency. Only the VLM backbone is duplicated, keeping the increase in model size minimal.
  • Figure 2: (a) Data efficiency. While RDT-1B and $\pi_0$ use 1M + single-arm data with sizable bimanual data, TwinVLA uses $\sim$0.5M single-arm data and only 50 target bimanual data. (b) Compute efficiency. RDT-1B and $\pi_0$ require high compute (exceeding 1,000 H100 GPU-days), whereas TwinVLA achieves higher or comparable performance at substantially lower compute (25 H100 GPU-days).
  • Figure 3: (a) Causal attention mask for joint attention. It preserves causality while processing shared, left, and right inputs in parallel. (b) TwinVLA joint attention mechanism. The two VLMs share information, and the shared modality $(l, I_\text{ego})_t$ is further processed by MoE to more efficiently leverage both VLMs.
  • Figure 4: Experimental setups. (a) We evaluate TwinVLA on three real-world bimanual tasks using an Anubis robot. (b) We further analyze TwinVLA on a large suite of simulation tasks: $5$ tasks in Tabletop-Sim and $50$ tasks in RoboTwin 2.0.
  • Figure 5: Success rates on real-world tasks. TwinVLA outperforms RDT-1B and DP on average. Moreover, TwinVLA shows comparable performance with $\pi_0$ while trained only on target data.
  • ...and 10 more figures