Table of Contents
Fetching ...

RoboDexVLM: Visual Language Model-Enabled Task Planning and Motion Control for Dexterous Robot Manipulation

Haichao Liu, Sikai Guo, Pengfei Mai, Jiahang Cao, Haoang Li, Jun Ma

TL;DR

RoboDexVLM tackles open-vocabulary, long-horizon dexterous manipulation by marrying a VLM-driven task planner with a modular eight-primitive skill library and a perception-action loop. It defines standardized interaction primitives and a memory-augmented workflow to translate natural language commands into sequences of dexterous actions, while employing a robust recovery mechanism based on reflection prompts. The system integrates language-guided segmentation, zero-shot grasp perception, and kin-based pose generation to enable context-aware manipulation on a real UR5 with an Inspire hand, demonstrating zero-shot capabilities and improved reliability over baselines. Collectively, the approach advances general-purpose, open-world manipulation by enabling semantic reasoning, dynamic replanning, and resilient execution in unstructured environments.

Abstract

This paper introduces RoboDexVLM, an innovative framework for robot task planning and grasp detection tailored for a collaborative manipulator equipped with a dexterous hand. Previous methods focus on simplified and limited manipulation tasks, which often neglect the complexities associated with grasping a diverse array of objects in a long-horizon manner. In contrast, our proposed framework utilizes a dexterous hand capable of grasping objects of varying shapes and sizes while executing tasks based on natural language commands. The proposed approach has the following core components: First, a robust task planner with a task-level recovery mechanism that leverages vision-language models (VLMs) is designed, which enables the system to interpret and execute open-vocabulary commands for long sequence tasks. Second, a language-guided dexterous grasp perception algorithm is presented based on robot kinematics and formal methods, tailored for zero-shot dexterous manipulation with diverse objects and commands. Comprehensive experimental results validate the effectiveness, adaptability, and robustness of RoboDexVLM in handling long-horizon scenarios and performing dexterous grasping. These results highlight the framework's ability to operate in complex environments, showcasing its potential for open-vocabulary dexterous manipulation. Our open-source project page can be found at https://henryhcliu.github.io/robodexvlm.

RoboDexVLM: Visual Language Model-Enabled Task Planning and Motion Control for Dexterous Robot Manipulation

TL;DR

RoboDexVLM tackles open-vocabulary, long-horizon dexterous manipulation by marrying a VLM-driven task planner with a modular eight-primitive skill library and a perception-action loop. It defines standardized interaction primitives and a memory-augmented workflow to translate natural language commands into sequences of dexterous actions, while employing a robust recovery mechanism based on reflection prompts. The system integrates language-guided segmentation, zero-shot grasp perception, and kin-based pose generation to enable context-aware manipulation on a real UR5 with an Inspire hand, demonstrating zero-shot capabilities and improved reliability over baselines. Collectively, the approach advances general-purpose, open-world manipulation by enabling semantic reasoning, dynamic replanning, and resilient execution in unstructured environments.

Abstract

This paper introduces RoboDexVLM, an innovative framework for robot task planning and grasp detection tailored for a collaborative manipulator equipped with a dexterous hand. Previous methods focus on simplified and limited manipulation tasks, which often neglect the complexities associated with grasping a diverse array of objects in a long-horizon manner. In contrast, our proposed framework utilizes a dexterous hand capable of grasping objects of varying shapes and sizes while executing tasks based on natural language commands. The proposed approach has the following core components: First, a robust task planner with a task-level recovery mechanism that leverages vision-language models (VLMs) is designed, which enables the system to interpret and execute open-vocabulary commands for long sequence tasks. Second, a language-guided dexterous grasp perception algorithm is presented based on robot kinematics and formal methods, tailored for zero-shot dexterous manipulation with diverse objects and commands. Comprehensive experimental results validate the effectiveness, adaptability, and robustness of RoboDexVLM in handling long-horizon scenarios and performing dexterous grasping. These results highlight the framework's ability to operate in complex environments, showcasing its potential for open-vocabulary dexterous manipulation. Our open-source project page can be found at https://henryhcliu.github.io/robodexvlm.

Paper Structure

This paper contains 14 sections, 8 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of our RoboDexVLM. The multimodal prompt, comprising human command, available skill list, RGB-D image, and relevant memory items, is transmitted to the VLM for task planning. Upon receiving the skill invoking sequence from the VLM, the dexterous robot executes the skills until task completion. Dashed lines indicate the recovery process following failed operations.
  • Figure 2: Working pipeline of the proposed RoboDexVLM. The system comprises several complementary modules designed to facilitate a closed-loop manipulation framework. The task manager orchestrates the execution of $\mathcal{F}_i$ based on $\mathcal{O}$ generated by the VLM. Skills are performed through the grounding of the four foundational capabilities established at the core of the system.
  • Figure 3: Robot manipulation system settings for RoboDexVLM. The robot manipulator (UR5) is supposed to grasp and manipulate the objects on the table and interact with the drawer or other kinds of containers using a dexterous hand (Inspire Hand) as the end effector with the perception generated by the RGB-D camera (RealSense D435i) mounted on the hand. The coordinates of the base $\{B\}$, hand $\{H\}$, end-effector $\{E\}$, and camera $\{C\}$ are illustrated accordingly.
  • Figure 4: The dexterous grasping pose generation process. The object name for segmentation is provided at the top of the figure. In the RGB image with mask, the semantic segmentation masks of the object described by text are marked accordingly. The blue anchors in the images of the grasping pose area are the grasp perception results.
  • Figure 5: Demonstration of long-horizon dexterous manipulation. The input of the RoboDexVLM framework is one sentence describing the task to be completed. The relevant skills are invoked automatically to interact with the objects for the open-vocabulary task. The corresponding videos are accessible in our https://henryhcliu.github.io/RoboDexVLM.