Table of Contents
Fetching ...

Robo-ABC: Affordance Generalization Beyond Categories via Semantic Correspondence for Robot Manipulation

Yuanchen Ju, Kaizhe Hu, Guowei Zhang, Gu Zhang, Mingrun Jiang, Huazhe Xu

TL;DR

Robo-ABC tackles the challenge of generalizing manipulation affordances to unseen objects by building an affordance memory from human videos and transferring interaction knowledge to novel objects through diffusion-model–driven semantic correspondence. The pipeline retrieves visually and semantically similar reference examples, maps contact points to new objects, and guides grasping with a downstream robot planner, all without additional annotation or training. Quantitatively, it achieves a 31.6 percentage-point improvement in affordance retrieval accuracy over state-of-the-art end-to-end methods and demonstrates 85.7% real-world success across seven object categories. This approach leverages foundation-model capabilities for cross-category generalization, enabling more flexible open-world robotic manipulation with practical implications for service robots and automation.

Abstract

Enabling robotic manipulation that generalizes to out-of-distribution scenes is a crucial step toward open-world embodied intelligence. For human beings, this ability is rooted in the understanding of semantic correspondence among objects, which naturally transfers the interaction experience of familiar objects to novel ones. Although robots lack such a reservoir of interaction experience, the vast availability of human videos on the Internet may serve as a valuable resource, from which we extract an affordance memory including the contact points. Inspired by the natural way humans think, we propose Robo-ABC: when confronted with unfamiliar objects that require generalization, the robot can acquire affordance by retrieving objects that share visual or semantic similarities from the affordance memory. The next step is to map the contact points of the retrieved objects to the new object. While establishing this correspondence may present formidable challenges at first glance, recent research finds it naturally arises from pre-trained diffusion models, enabling affordance mapping even across disparate object categories. Through the Robo-ABC framework, robots may generalize to manipulate out-of-category objects in a zero-shot manner without any manual annotation, additional training, part segmentation, pre-coded knowledge, or viewpoint restrictions. Quantitatively, Robo-ABC significantly enhances the accuracy of visual affordance retrieval by a large margin of 31.6% compared to state-of-the-art (SOTA) end-to-end affordance models. We also conduct real-world experiments of cross-category object-grasping tasks. Robo-ABC achieved a success rate of 85.7%, proving its capacity for real-world tasks.

Robo-ABC: Affordance Generalization Beyond Categories via Semantic Correspondence for Robot Manipulation

TL;DR

Robo-ABC tackles the challenge of generalizing manipulation affordances to unseen objects by building an affordance memory from human videos and transferring interaction knowledge to novel objects through diffusion-model–driven semantic correspondence. The pipeline retrieves visually and semantically similar reference examples, maps contact points to new objects, and guides grasping with a downstream robot planner, all without additional annotation or training. Quantitatively, it achieves a 31.6 percentage-point improvement in affordance retrieval accuracy over state-of-the-art end-to-end methods and demonstrates 85.7% real-world success across seven object categories. This approach leverages foundation-model capabilities for cross-category generalization, enabling more flexible open-world robotic manipulation with practical implications for service robots and automation.

Abstract

Enabling robotic manipulation that generalizes to out-of-distribution scenes is a crucial step toward open-world embodied intelligence. For human beings, this ability is rooted in the understanding of semantic correspondence among objects, which naturally transfers the interaction experience of familiar objects to novel ones. Although robots lack such a reservoir of interaction experience, the vast availability of human videos on the Internet may serve as a valuable resource, from which we extract an affordance memory including the contact points. Inspired by the natural way humans think, we propose Robo-ABC: when confronted with unfamiliar objects that require generalization, the robot can acquire affordance by retrieving objects that share visual or semantic similarities from the affordance memory. The next step is to map the contact points of the retrieved objects to the new object. While establishing this correspondence may present formidable challenges at first glance, recent research finds it naturally arises from pre-trained diffusion models, enabling affordance mapping even across disparate object categories. Through the Robo-ABC framework, robots may generalize to manipulate out-of-category objects in a zero-shot manner without any manual annotation, additional training, part segmentation, pre-coded knowledge, or viewpoint restrictions. Quantitatively, Robo-ABC significantly enhances the accuracy of visual affordance retrieval by a large margin of 31.6% compared to state-of-the-art (SOTA) end-to-end affordance models. We also conduct real-world experiments of cross-category object-grasping tasks. Robo-ABC achieved a success rate of 85.7%, proving its capacity for real-world tasks.
Paper Structure (37 sections, 2 equations, 8 figures, 7 tables)

This paper contains 37 sections, 2 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Overview. The left illustrates the key insight of Robo-ABC (pzd72 represents the contact point). Our goal is to endow robots with the semantic correspondence ability as humans, which can generalize the object affordance across categories in manipulation tasks. The columns on the right in order, are source images (pzd72 represents contact points which are extracted from human videos), corresponding attention maps on the target images (pzd72 represents inferred contact points on unseen objects), grasp poses (Grasp poses are represented by , which are generated according to the contact points pzd72), point cloud during grasping, and the final successful grasp results.
  • Figure 2: Our pipeline. The top part is the process of extracting knowledge about object affordance from human-object videos. Subsequently, we store these information as interaction memory to serve as the robot's interaction experience. When facing new objects, we retrieve the most similar object from the interaction memory based on visual and semantic similarity. After obtaining the contact point information, we leverage the powerful semantic correspondence capability in the diffusion model to achieve cross-object and out-of-category affordance generalization. Finally, we select the grasp pose from all the possible poses which are generated by AnyGrasp fang2023anygrasp to deploy on real robots. (pzd72 represents the positions for interacting with the object, represents all possible grasp poses generated by AnyGrasp, represents the grasp pose selected by pzd72 )
  • Figure 3: Affordance generalization beyond categories visualization results. In each group of figures from left to right, the span of object categories gradually increases. pzd72 represents the contact points extracted from human videos, while pzd72 represents the inferred points found by Robo-ABC across object categories.
  • Figure 4: Visualization of the affordance results. The highlighted areas are the ground truth masks, while pzd72pzd72pzd72pzd72pzd72pzd72 indicate the predicted contact points of different methods.
  • Figure 5: Success rate by category. We demonstrate the performance of Robo-ABC and other baselines across various object categories within the entire evaluation dataset. As can be seen, in the vast majority of cases, Robo-ABC exhibits superior zero-shot generalization capabilities.
  • ...and 3 more figures