Table of Contents
Fetching ...

DexDiff: Towards Extrinsic Dexterity Manipulation of Ungraspable Objects in Unrestricted Environments

Chengzhong Ma, Houxue Yang, Hanbo Zhang, Zeyang Liu, Chao Zhao, Jian Tang, Xuguang Lan, Nanning Zheng

TL;DR

DexDiff tackles ungraspable manipulation of large, flat objects in unrestricted environments by marrying a finetuned vision-language model for perception and high-level planning with a goal-conditioned diffusion policy (GCAD) for long-horizon action sequencing. The approach grounds action planning in environmental structure through plan returns (return-to-go) and learns horizon-aware policies via a denoising diffusion process. Empirical results show DexDiff achieves strong performance in simulation (about 70% average success) and real-world deployment (around 65% average success) and generalizes to unseen objects and extrinsic-dexterity configurations. The work demonstrates practical potential for robust extrinsic-dexterity manipulation and provides a benchmark for integrating perception, planning, and diffusion-based control in ungraspable tasks.

Abstract

Grasping large and flat objects (e.g. a book or a pan) is often regarded as an ungraspable task, which poses significant challenges due to the unreachable grasping poses. Previous works leverage Extrinsic Dexterity like walls or table edges to grasp such objects. However, they are limited to task-specific policies and lack task planning to find pre-grasp conditions. This makes it difficult to adapt to various environments and extrinsic dexterity constraints. Therefore, we present DexDiff, a robust robotic manipulation method for long-horizon planning with extrinsic dexterity. Specifically, we utilize a vision-language model (VLM) to perceive the environmental state and generate high-level task plans, followed by a goal-conditioned action diffusion (GCAD) model to predict the sequence of low-level actions. This model learns the low-level policy from offline data with the cumulative reward guided by high-level planning as the goal condition, which allows for improved prediction of robot actions. Experimental results demonstrate that our method not only effectively performs ungraspable tasks but also generalizes to previously unseen objects. It outperforms baselines by a 47% higher success rate in simulation and facilitates efficient deployment and manipulation in real-world scenarios.

DexDiff: Towards Extrinsic Dexterity Manipulation of Ungraspable Objects in Unrestricted Environments

TL;DR

DexDiff tackles ungraspable manipulation of large, flat objects in unrestricted environments by marrying a finetuned vision-language model for perception and high-level planning with a goal-conditioned diffusion policy (GCAD) for long-horizon action sequencing. The approach grounds action planning in environmental structure through plan returns (return-to-go) and learns horizon-aware policies via a denoising diffusion process. Empirical results show DexDiff achieves strong performance in simulation (about 70% average success) and real-world deployment (around 65% average success) and generalizes to unseen objects and extrinsic-dexterity configurations. The work demonstrates practical potential for robust extrinsic-dexterity manipulation and provides a benchmark for integrating perception, planning, and diffusion-based control in ungraspable tasks.

Abstract

Grasping large and flat objects (e.g. a book or a pan) is often regarded as an ungraspable task, which poses significant challenges due to the unreachable grasping poses. Previous works leverage Extrinsic Dexterity like walls or table edges to grasp such objects. However, they are limited to task-specific policies and lack task planning to find pre-grasp conditions. This makes it difficult to adapt to various environments and extrinsic dexterity constraints. Therefore, we present DexDiff, a robust robotic manipulation method for long-horizon planning with extrinsic dexterity. Specifically, we utilize a vision-language model (VLM) to perceive the environmental state and generate high-level task plans, followed by a goal-conditioned action diffusion (GCAD) model to predict the sequence of low-level actions. This model learns the low-level policy from offline data with the cumulative reward guided by high-level planning as the goal condition, which allows for improved prediction of robot actions. Experimental results demonstrate that our method not only effectively performs ungraspable tasks but also generalizes to previously unseen objects. It outperforms baselines by a 47% higher success rate in simulation and facilitates efficient deployment and manipulation in real-world scenarios.
Paper Structure (22 sections, 5 equations, 14 figures, 4 tables)

This paper contains 22 sections, 5 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: The robot may not grasp large flat objects on a tabletop from the top down. With the help of extrinsic dexterity, high-level task plans can be realized: [Left] Push the object against the wall, then rotate and grasp it from the side. [Right] Push the object to the edge of the table to keep it hanging and grasp it from the side.
  • Figure 2: Our DexDiff method primarily consists of two modules: the high-level perception and task planning module based on the VLM and the low-level action prediction and motion planning module based on our GCAD model.
  • Figure 3: The simulation environments: (a) Basic, (b) Empty, (c) Broad, (d) Surround.
  • Figure 4: Compared to traditional heuristic methods based on image segmentation and fixed rules.
  • Figure 5: We evaluate the generalization ability of GCAD in different experiment settings.
  • ...and 9 more figures