Table of Contents
Fetching ...

AffordDP: Generalizable Diffusion Policy with Transferable Affordance

Shijie Wu, Yihang Zhu, Yunao Huang, Kaizhen Zhu, Jiayuan Gu, Jingyi Yu, Ye Shi, Jingya Wang

TL;DR

AffordDP introduces a diffusion-based imitation learning framework that generalizes manipulation to unseen objects and categories by transferring 3D static and dynamic affordances through a 6D transform and by guiding diffusion sampling with affordance priors. It builds an affordance memory from foundation-model features and uses ICP-based registration to align dynamic affordances, conditioning a DDIM-based diffusion policy on scene, state, and affordances. An adaptive affordance-guided sampling process steers actions toward the target affordance without leaving the action manifold, yielding robust performance in simulation and real-world tasks with unseen objects. Experiments show superior generalization over two diffusion baselines across object instances and categories, with ablations highlighting the importance of trajectory information and guidance for zero-shot transfer.

Abstract

Diffusion-based policies have shown impressive performance in robotic manipulation tasks while struggling with out-of-domain distributions. Recent efforts attempted to enhance generalization by improving the visual feature encoding for diffusion policy. However, their generalization is typically limited to the same category with similar appearances. Our key insight is that leveraging affordances--manipulation priors that define "where" and "how" an agent interacts with an object--can substantially enhance generalization to entirely unseen object instances and categories. We introduce the Diffusion Policy with transferable Affordance (AffordDP), designed for generalizable manipulation across novel categories. AffordDP models affordances through 3D contact points and post-contact trajectories, capturing the essential static and dynamic information for complex tasks. The transferable affordance from in-domain data to unseen objects is achieved by estimating a 6D transformation matrix using foundational vision models and point cloud registration techniques. More importantly, we incorporate affordance guidance during diffusion sampling that can refine action sequence generation. This guidance directs the generated action to gradually move towards the desired manipulation for unseen objects while keeping the generated action within the manifold of action space. Experimental results from both simulated and real-world environments demonstrate that AffordDP consistently outperforms previous diffusion-based methods, successfully generalizing to unseen instances and categories where others fail.

AffordDP: Generalizable Diffusion Policy with Transferable Affordance

TL;DR

AffordDP introduces a diffusion-based imitation learning framework that generalizes manipulation to unseen objects and categories by transferring 3D static and dynamic affordances through a 6D transform and by guiding diffusion sampling with affordance priors. It builds an affordance memory from foundation-model features and uses ICP-based registration to align dynamic affordances, conditioning a DDIM-based diffusion policy on scene, state, and affordances. An adaptive affordance-guided sampling process steers actions toward the target affordance without leaving the action manifold, yielding robust performance in simulation and real-world tasks with unseen objects. Experiments show superior generalization over two diffusion baselines across object instances and categories, with ablations highlighting the importance of trajectory information and guidance for zero-shot transfer.

Abstract

Diffusion-based policies have shown impressive performance in robotic manipulation tasks while struggling with out-of-domain distributions. Recent efforts attempted to enhance generalization by improving the visual feature encoding for diffusion policy. However, their generalization is typically limited to the same category with similar appearances. Our key insight is that leveraging affordances--manipulation priors that define "where" and "how" an agent interacts with an object--can substantially enhance generalization to entirely unseen object instances and categories. We introduce the Diffusion Policy with transferable Affordance (AffordDP), designed for generalizable manipulation across novel categories. AffordDP models affordances through 3D contact points and post-contact trajectories, capturing the essential static and dynamic information for complex tasks. The transferable affordance from in-domain data to unseen objects is achieved by estimating a 6D transformation matrix using foundational vision models and point cloud registration techniques. More importantly, we incorporate affordance guidance during diffusion sampling that can refine action sequence generation. This guidance directs the generated action to gradually move towards the desired manipulation for unseen objects while keeping the generated action within the manifold of action space. Experimental results from both simulated and real-world environments demonstrate that AffordDP consistently outperforms previous diffusion-based methods, successfully generalizing to unseen instances and categories where others fail.

Paper Structure

This paper contains 34 sections, 14 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: We propose AffordDP, a novel diffusion-based imitation learning method for generalizable robotic manipulation. AffordDP leverages rich manipulation priors from transferable affordances to enhance generalization to unseen scenarios and incorporates affordance guidance to enable precise control.
  • Figure 2: Overview of AffordDP. The left part demonstrates static and dynamic affordance transfer. Given the target scene RGB-D image, AffordDP retrieves a similar object in the affordance memory and transfers its static and dynamic affordance to the target object. The right part illustrates the key components of affordance-guided diffusion policy. Conditioned on 3D affordance, point cloud observation, and robot proprioception, AffordDP utilizes the Diffusion Policy and adaptive affordance guidance for precise control.
  • Figure 3: Overview of Static and Dynamic Affordance Transfer. The left part is static affordance transfer, and the right part is dynamic affordance transfer. The source and target images are processed through SD-DINOv2 zhang2024tale to generate feature maps $F^S$ and $F^T$. The similarity is computed and used to find the corresponding target image points using an argmax operation. These points are projected back to obtain point clouds. Point-SAM zhou2024point is then used to obtain the source and target part point clouds. The Iterative Closest Point (ICP) algorithm is applied to the point clouds to determine the transformation matrix $T = [\mathbf{R}|\mathbf{t}]$ for dynamic affordance transfer.
  • Figure 4: Policy rollouts in the real world. The left part is an intuitive demonstration of three real-world task processes, involving PullDrawer, OpenDoor, and Pick&Place. The right part represents the policy rollout results. Without spatial perception and a lack of adequate training data, which can be over 200 demonstrations Chi-RSS-23Ze2024DP3, DP fails to target the object correctly, often resulting in random and potentially unsafe actions. Lacking comprehensive static and dynamic affordance information, DP3 struggles to grasp the target accurately, even if it moves the gripper close to the target. In contrast, our method effectively grasps the target object and completes the task, demonstrating accurate spatial targeting.
  • Figure 5: Real world experiment setup.
  • ...and 4 more figures