Table of Contents
Fetching ...

Scaffolding Dexterous Manipulation with Vision-Language Models

Vincent de Bakker, Joey Hejna, Tyler Ga Wei Lum, Onur Celik, Aleksandar Taranovic, Denis Blessing, Gerhard Neumann, Jeannette Bohg, Dorsa Sadigh

TL;DR

The paper tackles the data- and reward-design bottlenecks in dexterous manipulation by leveraging off-the-shelf vision-language models to generate high-level, coarse hand-object trajectories (scaffolds) from language and visual input. A low-level residual RL policy then learns to track these scaffolds in simulation, enabling closed-loop control without human demonstrations or handcrafted rewards. The approach is validated across eight simulated tasks with a humanoid-like hand and demonstrates real-world sim-to-real transfer with domain randomization. Key contributions include a three-phase VLM-based trajectory generation, a residual RL training regime with dense keypoint rewards, and a thorough analysis of failure modes and ablations. The results show strong generalization to unseen initial conditions and objects, advancing scalable, language-conditioned dexterous manipulation.

Abstract

Dexterous robotic hands are essential for performing complex manipulation tasks, yet remain difficult to train due to the challenges of demonstration collection and high-dimensional control. While reinforcement learning (RL) can alleviate the data bottleneck by generating experience in simulation, it typically relies on carefully designed, task-specific reward functions, which hinder scalability and generalization. Thus, contemporary works in dexterous manipulation have often bootstrapped from reference trajectories. These trajectories specify target hand poses that guide the exploration of RL policies and object poses that enable dense, task-agnostic rewards. However, sourcing suitable trajectories - particularly for dexterous hands - remains a significant challenge. Yet, the precise details in explicit reference trajectories are often unnecessary, as RL ultimately refines the motion. Our key insight is that modern vision-language models (VLMs) already encode the commonsense spatial and semantic knowledge needed to specify tasks and guide exploration effectively. Given a task description (e.g., "open the cabinet") and a visual scene, our method uses an off-the-shelf VLM to first identify task-relevant keypoints (e.g., handles, buttons) and then synthesize 3D trajectories for hand motion and object motion. Subsequently, we train a low-level residual RL policy in simulation to track these coarse trajectories or "scaffolds" with high fidelity. Across a number of simulated tasks involving articulated objects and semantic understanding, we demonstrate that our method is able to learn robust dexterous manipulation policies. Moreover, we showcase that our method transfers to real-world robotic hands without any human demonstrations or handcrafted rewards.

Scaffolding Dexterous Manipulation with Vision-Language Models

TL;DR

The paper tackles the data- and reward-design bottlenecks in dexterous manipulation by leveraging off-the-shelf vision-language models to generate high-level, coarse hand-object trajectories (scaffolds) from language and visual input. A low-level residual RL policy then learns to track these scaffolds in simulation, enabling closed-loop control without human demonstrations or handcrafted rewards. The approach is validated across eight simulated tasks with a humanoid-like hand and demonstrates real-world sim-to-real transfer with domain randomization. Key contributions include a three-phase VLM-based trajectory generation, a residual RL training regime with dense keypoint rewards, and a thorough analysis of failure modes and ablations. The results show strong generalization to unseen initial conditions and objects, advancing scalable, language-conditioned dexterous manipulation.

Abstract

Dexterous robotic hands are essential for performing complex manipulation tasks, yet remain difficult to train due to the challenges of demonstration collection and high-dimensional control. While reinforcement learning (RL) can alleviate the data bottleneck by generating experience in simulation, it typically relies on carefully designed, task-specific reward functions, which hinder scalability and generalization. Thus, contemporary works in dexterous manipulation have often bootstrapped from reference trajectories. These trajectories specify target hand poses that guide the exploration of RL policies and object poses that enable dense, task-agnostic rewards. However, sourcing suitable trajectories - particularly for dexterous hands - remains a significant challenge. Yet, the precise details in explicit reference trajectories are often unnecessary, as RL ultimately refines the motion. Our key insight is that modern vision-language models (VLMs) already encode the commonsense spatial and semantic knowledge needed to specify tasks and guide exploration effectively. Given a task description (e.g., "open the cabinet") and a visual scene, our method uses an off-the-shelf VLM to first identify task-relevant keypoints (e.g., handles, buttons) and then synthesize 3D trajectories for hand motion and object motion. Subsequently, we train a low-level residual RL policy in simulation to track these coarse trajectories or "scaffolds" with high fidelity. Across a number of simulated tasks involving articulated objects and semantic understanding, we demonstrate that our method is able to learn robust dexterous manipulation policies. Moreover, we showcase that our method transfers to real-world robotic hands without any human demonstrations or handcrafted rewards.

Paper Structure

This paper contains 72 sections, 5 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Overview of our method: a VLM generates hand and object keypoint trajectories from a language command and scene image. A low-level residual RL policy is trained to track these trajectories in simulation.
  • Figure 2: a) Training: a high-level VLM predicts 3D keypoint plans, which a low-level policy learns to track via RL. b) Inference: new plans are generated by the VLM, which are executed by the frozen low-level policy.
  • Figure 3: A depiction of the eight tasks used for evaluation. Each task belongs to one of four overarching categories.
  • Figure 4: Results on the simulation task suite. Success rate (in %) is averaged across 3 seeds and uncertainty is given by the standard error. Our method performs nearly as well as the oracle with perfectly scripted plans.
  • Figure 5: (Left) The performance of our method as we iteratively refine the high-level policy $\pi^h$ by providing successful plans $\tau$ in-context. (Right) The projected 3D plans on the evaluation set for each iteration.
  • ...and 6 more figures