Table of Contents
Fetching ...

RAM: Retrieval-Based Affordance Transfer for Generalizable Zero-Shot Robotic Manipulation

Yuxuan Kuang, Junjie Ye, Haoran Geng, Jiageng Mao, Congyue Deng, Leonidas Guibas, He Wang, Yue Wang

TL;DR

RAM tackles zero-shot robotic manipulation by building a large, multi-source affordance memory and employing a hierarchical retrieval pipeline to select relevant demonstrations. It transfers 2D affordances to 3D executable actions via a VFMs-guided correspondence and a sampling-based lifting process, enabling cross-embodiment generalization without in-domain demonstrations. Across simulation and real-world experiments, RAM outperforms training-based and prior retrieve-and-transfer baselines, with strong potential for downstream uses such as autonomous data collection, one-shot imitation, and LLM/VLM-integrated planning. The work demonstrates a scalable approach to cross-domain manipulation, while noting limitations in long-horizon tasks and calling for future improvements in task planning and direct 2D-to-3D lifting.

Abstract

This work proposes a retrieve-and-transfer framework for zero-shot robotic manipulation, dubbed RAM, featuring generalizability across various objects, environments, and embodiments. Unlike existing approaches that learn manipulation from expensive in-domain demonstrations, RAM capitalizes on a retrieval-based affordance transfer paradigm to acquire versatile manipulation capabilities from abundant out-of-domain data. First, RAM extracts unified affordance at scale from diverse sources of demonstrations including robotic data, human-object interaction (HOI) data, and custom data to construct a comprehensive affordance memory. Then given a language instruction, RAM hierarchically retrieves the most similar demonstration from the affordance memory and transfers such out-of-domain 2D affordance to in-domain 3D executable affordance in a zero-shot and embodiment-agnostic manner. Extensive simulation and real-world evaluations demonstrate that our RAM consistently outperforms existing works in diverse daily tasks. Additionally, RAM shows significant potential for downstream applications such as automatic and efficient data collection, one-shot visual imitation, and LLM/VLM-integrated long-horizon manipulation. For more details, please check our website at https://yxkryptonite.github.io/RAM/.

RAM: Retrieval-Based Affordance Transfer for Generalizable Zero-Shot Robotic Manipulation

TL;DR

RAM tackles zero-shot robotic manipulation by building a large, multi-source affordance memory and employing a hierarchical retrieval pipeline to select relevant demonstrations. It transfers 2D affordances to 3D executable actions via a VFMs-guided correspondence and a sampling-based lifting process, enabling cross-embodiment generalization without in-domain demonstrations. Across simulation and real-world experiments, RAM outperforms training-based and prior retrieve-and-transfer baselines, with strong potential for downstream uses such as autonomous data collection, one-shot imitation, and LLM/VLM-integrated planning. The work demonstrates a scalable approach to cross-domain manipulation, while noting limitations in long-horizon tasks and calling for future improvements in task planning and direct 2D-to-3D lifting.

Abstract

This work proposes a retrieve-and-transfer framework for zero-shot robotic manipulation, dubbed RAM, featuring generalizability across various objects, environments, and embodiments. Unlike existing approaches that learn manipulation from expensive in-domain demonstrations, RAM capitalizes on a retrieval-based affordance transfer paradigm to acquire versatile manipulation capabilities from abundant out-of-domain data. First, RAM extracts unified affordance at scale from diverse sources of demonstrations including robotic data, human-object interaction (HOI) data, and custom data to construct a comprehensive affordance memory. Then given a language instruction, RAM hierarchically retrieves the most similar demonstration from the affordance memory and transfers such out-of-domain 2D affordance to in-domain 3D executable affordance in a zero-shot and embodiment-agnostic manner. Extensive simulation and real-world evaluations demonstrate that our RAM consistently outperforms existing works in diverse daily tasks. Additionally, RAM shows significant potential for downstream applications such as automatic and efficient data collection, one-shot visual imitation, and LLM/VLM-integrated long-horizon manipulation. For more details, please check our website at https://yxkryptonite.github.io/RAM/.
Paper Structure (29 sections, 3 equations, 7 figures, 5 tables)

This paper contains 29 sections, 3 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: (a) We extract unified affordance representation from in-the-wild multi-source demonstrations, including robotic data, HOI data, and custom data, to construct a large-scale affordance memory. Given language instructions, RAM hierarchically retrieves and transfers the 2D affordance from memory and lifts it to 3D for robotic manipulation. (b-d) Our framework shows robust generalizability across diverse objects and embodiments in various settings.
  • Figure 2: Affordance annotation of demonstrations from various data sources. We extract unified affordance information automatically (for robotic and HOI data) or with minimal human effort (for custom data).
  • Figure 3: Illustration of our hierarchical retrieval pipeline.
  • Figure 4: Affordance transfer and lifting.
  • Figure 5: Qualitative comparison of generated 2D affordance. Source demonstrations retrieved by RAM are shown in the first row.
  • ...and 2 more figures