Table of Contents
Fetching ...

UAD: Unsupervised Affordance Distillation for Generalization in Robotic Manipulation

Yihe Tang, Wenlong Huang, Yingke Wang, Chengshu Li, Roy Yuan, Ruohan Zhang, Jiajun Wu, Li Fei-Fei

TL;DR

The paper introduces Unsupervised Affordance Distillation (UAD), a framework that distills fine-grained, instruction-conditioned affordances from foundation models into a compact, task-conditioned predictor trained without manual annotations. By fusing multi-view DINOv2 features, leveraging vision-language prompts to link regions with tasks, and applying FiLM conditioning, UAD produces continuous pixel-level affordance maps used as the observation space for imitation-learning policies. The approach demonstrates strong open-world generalization in both simulation and real-world robotic tasks, achieving competitive zero-shot performance on benchmarks and enabling manipulation with as few as 10 demonstrations. This work highlights the potential of unsupervised grounding of affordances from foundation models to enhance generalization and sample efficiency in robotic manipulation.

Abstract

Understanding fine-grained object affordances is imperative for robots to manipulate objects in unstructured environments given open-ended task instructions. However, existing methods of visual affordance predictions often rely on manually annotated data or conditions only on a predefined set of tasks. We introduce UAD (Unsupervised Affordance Distillation), a method for distilling affordance knowledge from foundation models into a task-conditioned affordance model without any manual annotations. By leveraging the complementary strengths of large vision models and vision-language models, UAD automatically annotates a large-scale dataset with detailed $<$instruction, visual affordance$>$ pairs. Training only a lightweight task-conditioned decoder atop frozen features, UAD exhibits notable generalization to in-the-wild robotic scenes and to various human activities, despite only being trained on rendered objects in simulation. Using affordance provided by UAD as the observation space, we show an imitation learning policy that demonstrates promising generalization to unseen object instances, object categories, and even variations in task instructions after training on as few as 10 demonstrations. Project website: https://unsup-affordance.github.io/

UAD: Unsupervised Affordance Distillation for Generalization in Robotic Manipulation

TL;DR

The paper introduces Unsupervised Affordance Distillation (UAD), a framework that distills fine-grained, instruction-conditioned affordances from foundation models into a compact, task-conditioned predictor trained without manual annotations. By fusing multi-view DINOv2 features, leveraging vision-language prompts to link regions with tasks, and applying FiLM conditioning, UAD produces continuous pixel-level affordance maps used as the observation space for imitation-learning policies. The approach demonstrates strong open-world generalization in both simulation and real-world robotic tasks, achieving competitive zero-shot performance on benchmarks and enabling manipulation with as few as 10 demonstrations. This work highlights the potential of unsupervised grounding of affordances from foundation models to enhance generalization and sample efficiency in robotic manipulation.

Abstract

Understanding fine-grained object affordances is imperative for robots to manipulate objects in unstructured environments given open-ended task instructions. However, existing methods of visual affordance predictions often rely on manually annotated data or conditions only on a predefined set of tasks. We introduce UAD (Unsupervised Affordance Distillation), a method for distilling affordance knowledge from foundation models into a task-conditioned affordance model without any manual annotations. By leveraging the complementary strengths of large vision models and vision-language models, UAD automatically annotates a large-scale dataset with detailed instruction, visual affordance pairs. Training only a lightweight task-conditioned decoder atop frozen features, UAD exhibits notable generalization to in-the-wild robotic scenes and to various human activities, despite only being trained on rendered objects in simulation. Using affordance provided by UAD as the observation space, we show an imitation learning policy that demonstrates promising generalization to unseen object instances, object categories, and even variations in task instructions after training on as few as 10 demonstrations. Project website: https://unsup-affordance.github.io/

Paper Structure

This paper contains 20 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: Unsupervised Affordance Distillation (UAD) extracts affordance annotations from large pre-trained models and distills them into a task-conditioned affordance model, which is capable of predicting fine-grained affordance in open-world scenes with open-ended instructions, enabling diverse generalization properties in downstream policy learning.
  • Figure 2: Overview of Unsupervised Affordance Distillation (UAD). Using renderings of 3D objects, we first perform multi-view fusion of DINOv2 features and clustering to obtain fine-grained semantic regions of objects, which are then fed to VLM for proposing relevant tasks and corresponding regions (a). The extracted affordance is then distilled by training a language-conditioning FiLM atop frozen DINOv2 features (b). The learned task-conditioned affordance model provides in-the-wild prediction for diverse fine-grained regions, which are used as observation space for manipulation policies (c).
  • Figure 3: Tasks for evaluating UAD. Left: tasks in simulation along with different generalization requirements. Right: tasks in the real world and the corresponding success rate achieved by UAD-based policies.
  • Figure 4: Task-conditioned affordance prediction results on the DROID dataset. Average AUC scores (evaluated on the entire dataset): 0.500 (CLIP), 0.836 (OpenSeeD), 0.840 (Ours).
  • Figure 5: Zero-shot generalization to affordance predictions in human activities from AGD20K.
  • ...and 3 more figures