Table of Contents
Fetching ...

Towards Human-Like Manipulation through RL-Augmented Teleoperation and Mixture-of-Dexterous-Experts VLA

Tutian Tang, Xingyu Ji, Wanli Xing, Ce Hao, Wenqiang Xu, Lin Shao, Cewu Lu, Qiaojun Yu, Jiangmiao Pang, Kaifeng Zhang

TL;DR

IMCopilot is introduced, a suite of reinforcement learning-trained atomic skills that plays a dual role: it acts as a shared-autonomy assistant to simplify teleoperation data collection, and it serves as a callable low-level execution primitive for the VLA.

Abstract

While Vision-Language-Action (VLA) models have demonstrated remarkable success in robotic manipulation, their application has largely been confined to low-degree-of-freedom end-effectors performing simple, vision-guided pick-and-place tasks. Extending these models to human-like, bimanual dexterous manipulation-specifically contact-rich in-hand operations-introduces critical challenges in high-fidelity data acquisition, multi-skill learning, and multimodal sensory fusion. In this paper, we propose an integrated framework to address these bottlenecks, built upon two components. First, we introduce IMCopilot (In-hand Manipulation Copilot), a suite of reinforcement learning-trained atomic skills that plays a dual role: it acts as a shared-autonomy assistant to simplify teleoperation data collection, and it serves as a callable low-level execution primitive for the VLA. Second, we present MoDE-VLA (Mixture-of-Dexterous-Experts VLA), an architecture that seamlessly integrates heterogeneous force and tactile modalities into a pretrained VLA backbone. By utilizing a residual injection mechanism, MoDE-VLA enables contact-aware refinement without degrading the model's pretrained knowledge. We validate our approach on four tasks of escalating complexity, demonstrating doubled success rate improvement over the baseline in dexterous contact-rich tasks.

Towards Human-Like Manipulation through RL-Augmented Teleoperation and Mixture-of-Dexterous-Experts VLA

TL;DR

IMCopilot is introduced, a suite of reinforcement learning-trained atomic skills that plays a dual role: it acts as a shared-autonomy assistant to simplify teleoperation data collection, and it serves as a callable low-level execution primitive for the VLA.

Abstract

While Vision-Language-Action (VLA) models have demonstrated remarkable success in robotic manipulation, their application has largely been confined to low-degree-of-freedom end-effectors performing simple, vision-guided pick-and-place tasks. Extending these models to human-like, bimanual dexterous manipulation-specifically contact-rich in-hand operations-introduces critical challenges in high-fidelity data acquisition, multi-skill learning, and multimodal sensory fusion. In this paper, we propose an integrated framework to address these bottlenecks, built upon two components. First, we introduce IMCopilot (In-hand Manipulation Copilot), a suite of reinforcement learning-trained atomic skills that plays a dual role: it acts as a shared-autonomy assistant to simplify teleoperation data collection, and it serves as a callable low-level execution primitive for the VLA. Second, we present MoDE-VLA (Mixture-of-Dexterous-Experts VLA), an architecture that seamlessly integrates heterogeneous force and tactile modalities into a pretrained VLA backbone. By utilizing a residual injection mechanism, MoDE-VLA enables contact-aware refinement without degrading the model's pretrained knowledge. We validate our approach on four tasks of escalating complexity, demonstrating doubled success rate improvement over the baseline in dexterous contact-rich tasks.
Paper Structure (25 sections, 5 equations, 4 figures, 2 tables)

This paper contains 25 sections, 5 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of our proposed framework. (a) We introduce an RL-augmented teleoperation system equipped with force and tactile feedback, featuring the IMCopilot to assist human operators. (b) With data collected, we train the MoDE-VLA model capable of executing highly complex, long-horizon tasks such as peeling an apple. Here IMCopilot works with VLA as a callable low-level skill for in-hand manipulation. (c) Our learned policy successfully generalizes to a variety of other dexterous, contact-rich tasks, including tube rearranging, charger plugging, and gear assembling.
  • Figure 2: System Overview. (a) The teleoperation system, with exoskeletons, a VR headset, and foot pedals. (b) The VR view, integrating the robot's camera stream with real-time force and tactile feedback overlays. (c) The robot platform used for executing contact-rich tasks.
  • Figure 3: Overview of MoDE-VLA. Left: the OpenPI-0 backbone encodes visual, linguistic, proprioceptive, and noisy action inputs into token sequences. Center: the Mixture-of-Dexterous-Experts (MoDE) VLA ingests force and tactile observations, routes them through sparse experts, and produces modality-specific residual corrections---force-guided adjustments for arm actions and tactile-guided adjustments for hand actions. Right: a hierarchical decision mechanism selects between two options at each timestep: Option 1, where hand actions are generated by the VLA with MoDE tactile refinement via flow matching, and Option 2, where the RL-trained IMCopilot directly governs hand actions based on hand proprioceptive states. In both cases, arm and other actions are produced by the VLA with MoDE force refinement.
  • Figure 4: Illustration of the four evaluation tasks (rows): Apple Peeling, Tube Rearranging, Gear Assembling, and Charger Plugging. Each row shows five key frames of task execution from left to right.