Table of Contents
Fetching ...

Vision-Language-Action Model and Diffusion Policy Switching Enables Dexterous Control of an Anthropomorphic Hand

Cheng Pan, Kai Junge, Josie Hughes

TL;DR

This work proposes a hybrid control method that combines the relative advantages of a fine-tuned Vision-Language-Action (VLA) model and diffusion models and enables event based transitions between these two models for a pick-and-place task where the target object and placement location is commanded through language.

Abstract

To advance autonomous dexterous manipulation, we propose a hybrid control method that combines the relative advantages of a fine-tuned Vision-Language-Action (VLA) model and diffusion models. The VLA model provides language commanded high-level planning, which is highly generalizable, while the diffusion model handles low-level interactions which offers the precision and robustness required for specific objects and environments. By incorporating a switching signal into the training-data, we enable event based transitions between these two models for a pick-and-place task where the target object and placement location is commanded through language. This approach is deployed on our anthropomorphic ADAPT Hand 2, a 13DoF robotic hand, which incorporates compliance through series elastic actuation allowing for resilience for any interactions: showing the first use of a multi-fingered hand controlled with a VLA model. We demonstrate this model switching approach results in a over 80\% success rate compared to under 40\% when only using a VLA model, enabled by accurate near-object arm motion by the VLA model and a multi-modal grasping motion with error recovery abilities from the diffusion model.

Vision-Language-Action Model and Diffusion Policy Switching Enables Dexterous Control of an Anthropomorphic Hand

TL;DR

This work proposes a hybrid control method that combines the relative advantages of a fine-tuned Vision-Language-Action (VLA) model and diffusion models and enables event based transitions between these two models for a pick-and-place task where the target object and placement location is commanded through language.

Abstract

To advance autonomous dexterous manipulation, we propose a hybrid control method that combines the relative advantages of a fine-tuned Vision-Language-Action (VLA) model and diffusion models. The VLA model provides language commanded high-level planning, which is highly generalizable, while the diffusion model handles low-level interactions which offers the precision and robustness required for specific objects and environments. By incorporating a switching signal into the training-data, we enable event based transitions between these two models for a pick-and-place task where the target object and placement location is commanded through language. This approach is deployed on our anthropomorphic ADAPT Hand 2, a 13DoF robotic hand, which incorporates compliance through series elastic actuation allowing for resilience for any interactions: showing the first use of a multi-fingered hand controlled with a VLA model. We demonstrate this model switching approach results in a over 80\% success rate compared to under 40\% when only using a VLA model, enabled by accurate near-object arm motion by the VLA model and a multi-modal grasping motion with error recovery abilities from the diffusion model.

Paper Structure

This paper contains 14 sections, 10 figures, 1 table.

Figures (10)

  • Figure 1: Combined VLA and Diffusion policy approach for dexterous manipulation which uses an event signal to transition between the different policies, enabling text input to be translated to hand and wrist commands for a anthropomorphic manipulator.
  • Figure 2: Depiction of the concept to switch between the VLA and diffusion model using a common event signal $\sigma$ that tracks key moments in the pick-and-place task.
  • Figure 3: ADAPT Hand 2, highlighting the soft continuous skin, compliant series elastic finger joints, and the anatomically driven design.
  • Figure 4: Left) Robot setup for gathering training-data through teleoperation, showing the use of the Vision pro, and the location of the two cameras for capturing training data. Right) The test objects and environment used for data-capture and testing.
  • Figure 5: A) Data-collection process for the VLA which includes the full grasping process, and the event signal recording. B) Data-collection process for the diffusion model which includes only the grasping portion of the demonstration.
  • ...and 5 more figures