Table of Contents
Fetching ...

Emerging Extrinsic Dexterity in Cluttered Scenes via Dynamics-aware Policy Learning

Yixin Zheng, Jiangran Lyu, Yifan Zhang, Jiayi Chen, Mi Yan, Yuntian Deng, Xuesong Shi, Xiaoguang Zhao, Yizhou Wang, Zhizheng Zhang, He Wang

TL;DR

A Dynamics-Aware Policy Learning (DAPL) framework that can facilitate policy learning with a learned representation of contact-induced object dynamics in cluttered environments is introduced, enabling extrinsic dexterity to emerge without hand-crafted contact heuristics or complex reward shaping.

Abstract

Extrinsic dexterity leverages environmental contact to overcome the limitations of prehensile manipulation. However, achieving such dexterity in cluttered scenes remains challenging and underexplored, as it requires selectively exploiting contact among multiple interacting objects with inherently coupled dynamics. Existing approaches lack explicit modeling of such complex dynamics and therefore fall short in non-prehensile manipulation in cluttered environments, which in turn limits their practical applicability in real-world environments. In this paper, we introduce a Dynamics-Aware Policy Learning (DAPL) framework that can facilitate policy learning with a learned representation of contact-induced object dynamics in cluttered environments. This representation is learned through explicit world modeling and used to condition reinforcement learning, enabling extrinsic dexterity to emerge without hand-crafted contact heuristics or complex reward shaping. We evaluate our approach in both simulation and the real world. Our method outperforms prehensile manipulation, human teleoperation, and prior representation-based policies by over 25% in success rate on unseen simulated cluttered scenes with varying densities. The real-world success rate reaches around 50% across 10 cluttered scenes, while a practical grocery deployment further demonstrates robust sim-to-real transfer and applicability.

Emerging Extrinsic Dexterity in Cluttered Scenes via Dynamics-aware Policy Learning

TL;DR

A Dynamics-Aware Policy Learning (DAPL) framework that can facilitate policy learning with a learned representation of contact-induced object dynamics in cluttered environments is introduced, enabling extrinsic dexterity to emerge without hand-crafted contact heuristics or complex reward shaping.

Abstract

Extrinsic dexterity leverages environmental contact to overcome the limitations of prehensile manipulation. However, achieving such dexterity in cluttered scenes remains challenging and underexplored, as it requires selectively exploiting contact among multiple interacting objects with inherently coupled dynamics. Existing approaches lack explicit modeling of such complex dynamics and therefore fall short in non-prehensile manipulation in cluttered environments, which in turn limits their practical applicability in real-world environments. In this paper, we introduce a Dynamics-Aware Policy Learning (DAPL) framework that can facilitate policy learning with a learned representation of contact-induced object dynamics in cluttered environments. This representation is learned through explicit world modeling and used to condition reinforcement learning, enabling extrinsic dexterity to emerge without hand-crafted contact heuristics or complex reward shaping. We evaluate our approach in both simulation and the real world. Our method outperforms prehensile manipulation, human teleoperation, and prior representation-based policies by over 25% in success rate on unseen simulated cluttered scenes with varying densities. The real-world success rate reaches around 50% across 10 cluttered scenes, while a practical grocery deployment further demonstrates robust sim-to-real transfer and applicability.
Paper Structure (31 sections, 31 equations, 16 figures, 7 tables)

This paper contains 31 sections, 31 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: Visualizations of representative scenarios within the Sparse clutter level.
  • Figure 2: Overview of the proposed two-stage learning framework. Stage 1 (World Model Pretraining): The model takes a cropped point cloud augmented with physical attributes (mass and velocity) as input. A Transformer-based architecture encodes these inputs into dynamics features ($f_{dy}$), which are used by an MLP decoder to predict future per-point positions and velocities conditioned on robot actions. Stage 2 (Policy Learning): The pre-trained dynamics representations ($f_dy$) are fed into an Actor-Critic policy network alongside proprioceptive data and task goals to facilitate efficient policy learning within a physical simulator.
  • Figure 2: Visualizations of representative scenarios within the Moderate clutter level.
  • Figure 3: Overview of our proposed Clutter6D Benchmark and real-world setup. (a) Representative examples of the Clutter6D Benchmark with three levels of scene density: Sparse (4 objects), Moderate (8 objects), and Dense (12 objects). (b) The real-world experimental setup consisting of a Franka Research 3 robot and three Intel RealSense cameras for point cloud acquisition, along with the complete set of objects used in our real-world experiments.
  • Figure 3: Visualizations of representative scenarios within the Dense clutter level.
  • ...and 11 more figures