Table of Contents
Fetching ...

Robotic Assistant: Completing Collaborative Tasks with Dexterous Vision-Language-Action Models

Boshi An, Chenyu Yang, Robert Katzschmann

TL;DR

The paper addresses dexterous human-robot collaboration by adapting a pre-trained Vision-Language-Action model (Open-VLA) for collaborative tasks, introducing FiLM conditioning, an auxiliary hand-intent head, and action-space post-processing to produce compact delta actions and PCA-based finger representations. It validates the approach on a teleoperated Mimic-on-Franka setup with multi-view data, showing that a small number of principal components can capture most hand-joint variance and that delta-action representations improve learning stability. Real-time performance (~0.3 s latency on a RTX 4090) enables long-horizon sequences such as pick-up and handover, but trainer overfitting to a single demonstrator remains a key limitation that affects generalization. The work demonstrates that large VLA models can be effectively repurposed for physical collaboration with appropriate inductive biases, while identifying latency, generalization across collaborators, and flexible planning as important directions for future improvement.

Abstract

We adapt a pre-trained Vision-Language-Action (VLA) model (Open-VLA) for dexterous human-robot collaboration with minimal language prompting. Our approach adds (i) FiLM conditioning to visual backbones for task-aware perception, (ii) an auxiliary intent head that predicts collaborator hand pose and target cues, and (iii) action-space post-processing that predicts compact deltas (position/rotation) and PCA-reduced finger joints before mapping to full commands. Using a multi-view, teleoperated Franka and Mimic-hand dataset augmented with MediaPipe hand poses, we demonstrate that delta actions are well-behaved and that four principal components explain ~96% of hand-joint variance. Ablations identify action post-processing as the primary performance driver; auxiliary intent helps, FiLM is mixed, and a directional motion loss is detrimental. A real-time stack (~0.3 s latency on one RTX 4090) composes "pick-up" and "pass" into a long-horizon behavior. We surface "trainer overfitting" to specific demonstrators as the key limitation.

Robotic Assistant: Completing Collaborative Tasks with Dexterous Vision-Language-Action Models

TL;DR

The paper addresses dexterous human-robot collaboration by adapting a pre-trained Vision-Language-Action model (Open-VLA) for collaborative tasks, introducing FiLM conditioning, an auxiliary hand-intent head, and action-space post-processing to produce compact delta actions and PCA-based finger representations. It validates the approach on a teleoperated Mimic-on-Franka setup with multi-view data, showing that a small number of principal components can capture most hand-joint variance and that delta-action representations improve learning stability. Real-time performance (~0.3 s latency on a RTX 4090) enables long-horizon sequences such as pick-up and handover, but trainer overfitting to a single demonstrator remains a key limitation that affects generalization. The work demonstrates that large VLA models can be effectively repurposed for physical collaboration with appropriate inductive biases, while identifying latency, generalization across collaborators, and flexible planning as important directions for future improvement.

Abstract

We adapt a pre-trained Vision-Language-Action (VLA) model (Open-VLA) for dexterous human-robot collaboration with minimal language prompting. Our approach adds (i) FiLM conditioning to visual backbones for task-aware perception, (ii) an auxiliary intent head that predicts collaborator hand pose and target cues, and (iii) action-space post-processing that predicts compact deltas (position/rotation) and PCA-reduced finger joints before mapping to full commands. Using a multi-view, teleoperated Franka and Mimic-hand dataset augmented with MediaPipe hand poses, we demonstrate that delta actions are well-behaved and that four principal components explain ~96% of hand-joint variance. Ablations identify action post-processing as the primary performance driver; auxiliary intent helps, FiLM is mixed, and a directional motion loss is detrimental. A real-time stack (~0.3 s latency on one RTX 4090) composes "pick-up" and "pass" into a long-horizon behavior. We surface "trainer overfitting" to specific demonstrators as the key limitation.

Paper Structure

This paper contains 25 sections, 1 equation, 11 figures.

Figures (11)

  • Figure 1: Robotic system
  • Figure 2: The individuals involved in data collection.
  • Figure 3: The final composition of the synchronized dataset.
  • Figure 4: The modified model structure. Red block represents the FiLM layers added to vision encoders, orange block represents the modified action chunking projector, blue block represents the modified action post-processing module.
  • Figure 5: Overview of the training pipeline. The training is carried out with data distribution on a 4 GPU computation node.
  • ...and 6 more figures