Table of Contents
Fetching ...

RT-Affordance: Affordances are Versatile Intermediate Representations for Robot Manipulation

Soroush Nasiriany, Sean Kirmani, Tianli Ding, Laura Smith, Yuke Zhu, Danny Driess, Dorsa Sadigh, Ted Xiao

TL;DR

RT-Affordance introduces a hierarchical affordance-based policy that conditions manipulation on an affordance plan $q$ derived from language $l$ and perception $o$, with an affordance predictor $\phi(q|l,o)$ enabling test-time planning without extra demonstrations. By projecting $q$ onto the input and conditioning the policy $\pi(a|l,o,q)$ on this guidance, the approach leverages web-scale data and cheap in-domain affordance images to achieve up to around 70% success on novel tasks, significantly outperforming language- or goal-conditioned baselines and showing robustness to distribution shifts. The combination of an affordance generator and an affordance-conditioned policy enables scalable, data-efficient learning for diverse tasks, including grasping and placement, while maintaining performance in unseen settings. Limitations include incomplete generalization to completely new motion types, motivating future work to fuse multiple intermediate representations for broader capabilities.

Abstract

We explore how intermediate policy representations can facilitate generalization by providing guidance on how to perform manipulation tasks. Existing representations such as language, goal images, and trajectory sketches have been shown to be helpful, but these representations either do not provide enough context or provide over-specified context that yields less robust policies. We propose conditioning policies on affordances, which capture the pose of the robot at key stages of the task. Affordances offer expressive yet lightweight abstractions, are easy for users to specify, and facilitate efficient learning by transferring knowledge from large internet datasets. Our method, RT-Affordance, is a hierarchical model that first proposes an affordance plan given the task language, and then conditions the policy on this affordance plan to perform manipulation. Our model can flexibly bridge heterogeneous sources of supervision including large web datasets and robot trajectories. We additionally train our model on cheap-to-collect in-domain affordance images, allowing us to learn new tasks without collecting any additional costly robot trajectories. We show on a diverse set of novel tasks how RT-Affordance exceeds the performance of existing methods by over 50%, and we empirically demonstrate that affordances are robust to novel settings. Videos available at https://snasiriany.me/rt-affordance

RT-Affordance: Affordances are Versatile Intermediate Representations for Robot Manipulation

TL;DR

RT-Affordance introduces a hierarchical affordance-based policy that conditions manipulation on an affordance plan derived from language and perception , with an affordance predictor enabling test-time planning without extra demonstrations. By projecting onto the input and conditioning the policy on this guidance, the approach leverages web-scale data and cheap in-domain affordance images to achieve up to around 70% success on novel tasks, significantly outperforming language- or goal-conditioned baselines and showing robustness to distribution shifts. The combination of an affordance generator and an affordance-conditioned policy enables scalable, data-efficient learning for diverse tasks, including grasping and placement, while maintaining performance in unseen settings. Limitations include incomplete generalization to completely new motion types, motivating future work to fuse multiple intermediate representations for broader capabilities.

Abstract

We explore how intermediate policy representations can facilitate generalization by providing guidance on how to perform manipulation tasks. Existing representations such as language, goal images, and trajectory sketches have been shown to be helpful, but these representations either do not provide enough context or provide over-specified context that yields less robust policies. We propose conditioning policies on affordances, which capture the pose of the robot at key stages of the task. Affordances offer expressive yet lightweight abstractions, are easy for users to specify, and facilitate efficient learning by transferring knowledge from large internet datasets. Our method, RT-Affordance, is a hierarchical model that first proposes an affordance plan given the task language, and then conditions the policy on this affordance plan to perform manipulation. Our model can flexibly bridge heterogeneous sources of supervision including large web datasets and robot trajectories. We additionally train our model on cheap-to-collect in-domain affordance images, allowing us to learn new tasks without collecting any additional costly robot trajectories. We show on a diverse set of novel tasks how RT-Affordance exceeds the performance of existing methods by over 50%, and we empirically demonstrate that affordances are robust to novel settings. Videos available at https://snasiriany.me/rt-affordance

Paper Structure

This paper contains 13 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Bridging robot and internet data via affordances. Prior work has shown the utility of co-training on robot and web datasets. However, robot actions and web content are still disjoint in their structure. We propose using affordances as a means to bridge this gap. Reasoning about affordances requires semantic and spatial reasoning, which is readily needed in VQA and spatial reasoning tasks such as object detection. By incorporating affordance reasoning explicitly in robot control tasks, we can better transfer knowledge from these web datasets to robot control tasks.
  • Figure 2: Comparison of policy interfaces. Conditioning on language is intuitive, yet language typically does not provide enough guidance on how to perform the task. Goal images and trajectory sketches are typically over-specified and present learning challenges. We propose conditioning policies on intermediate affordance representations, which are expressive yet compact representations of tasks, making them easy to specify and to learn.
  • Figure 3: Model overview. Our hierarchical model first predicts the affordance plan given the task language and initial image of the task. We overlay the affordance (pixel xy values in raw text form) onto the image, and subsequently condition the policy on images overlaid with the affordance plan. We co-train the model on web datasets (largest data source), robot trajectories, and a modest number of cheap-to-collect images labeled with affordances.
  • Figure 4: Evaluation of the affordance prediction model on out of distribution scenarios. We perform a comprehensive evaluation of the affordance prediction model on in-distribution and out-of-distribution (OOD) and observe a graceful degradation of performance in OOD settings.
  • Figure 5: Robustness to out of distribution factors We show examples of successful and incorrect predictions of our affordance prediction model across in-distribution and out-of-distribution settings. Successful predictions are highlighted in green and incorrect predictions are highlighted in red.