Table of Contents
Fetching ...

A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation

Rongtao Xu, Jian Zhang, Minghao Guo, Youpeng Wen, Haoting Yang, Min Lin, Jianzheng Huang, Zhe Li, Kaidong Zhang, Liqiong Wang, Yuxuan Kuang, Meng Cao, Feng Zheng, Xiaodan Liang

TL;DR

The paper tackles the problem of robust spatial affordance understanding for robotic manipulation across diverse platforms. It introduces A0, an Affordance-Aware Hierarchical Diffusion Model that learns an Embodiment-Agnostic Affordance Representation and then generates action waypoints conditioned on language and vision. Key contributions include a diffusion-based architecture with Position Offset Attention and Spatial Information Aggregation Layer, pretraining on 1 million contact points, and strong cross-platform performance on Franka, Kinova, Realman, and Dobot, particularly for trajectory-driven tasks like wiping and stacking. Experiments demonstrate superior accuracy and efficiency compared to 2D affordance methods and Vision-Language-Action baselines, validating practical, real-world applicability and platform-agnostic generalization.

Abstract

Robotic manipulation faces critical challenges in understanding spatial affordances--the "where" and "how" of object interactions--essential for complex manipulation tasks like wiping a board or stacking objects. Existing methods, including modular-based and end-to-end approaches, often lack robust spatial reasoning capabilities. Unlike recent point-based and flow-based affordance methods that focus on dense spatial representations or trajectory modeling, we propose A0, a hierarchical affordance-aware diffusion model that decomposes manipulation tasks into high-level spatial affordance understanding and low-level action execution. A0 leverages the Embodiment-Agnostic Affordance Representation, which captures object-centric spatial affordances by predicting contact points and post-contact trajectories. A0 is pre-trained on 1 million contact points data and fine-tuned on annotated trajectories, enabling generalization across platforms. Key components include Position Offset Attention for motion-aware feature extraction and a Spatial Information Aggregation Layer for precise coordinate mapping. The model's output is executed by the action execution module. Experiments on multiple robotic systems (Franka, Kinova, Realman, and Dobot) demonstrate A0's superior performance in complex tasks, showcasing its efficiency, flexibility, and real-world applicability.

A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation

TL;DR

The paper tackles the problem of robust spatial affordance understanding for robotic manipulation across diverse platforms. It introduces A0, an Affordance-Aware Hierarchical Diffusion Model that learns an Embodiment-Agnostic Affordance Representation and then generates action waypoints conditioned on language and vision. Key contributions include a diffusion-based architecture with Position Offset Attention and Spatial Information Aggregation Layer, pretraining on 1 million contact points, and strong cross-platform performance on Franka, Kinova, Realman, and Dobot, particularly for trajectory-driven tasks like wiping and stacking. Experiments demonstrate superior accuracy and efficiency compared to 2D affordance methods and Vision-Language-Action baselines, validating practical, real-world applicability and platform-agnostic generalization.

Abstract

Robotic manipulation faces critical challenges in understanding spatial affordances--the "where" and "how" of object interactions--essential for complex manipulation tasks like wiping a board or stacking objects. Existing methods, including modular-based and end-to-end approaches, often lack robust spatial reasoning capabilities. Unlike recent point-based and flow-based affordance methods that focus on dense spatial representations or trajectory modeling, we propose A0, a hierarchical affordance-aware diffusion model that decomposes manipulation tasks into high-level spatial affordance understanding and low-level action execution. A0 leverages the Embodiment-Agnostic Affordance Representation, which captures object-centric spatial affordances by predicting contact points and post-contact trajectories. A0 is pre-trained on 1 million contact points data and fine-tuned on annotated trajectories, enabling generalization across platforms. Key components include Position Offset Attention for motion-aware feature extraction and a Spatial Information Aggregation Layer for precise coordinate mapping. The model's output is executed by the action execution module. Experiments on multiple robotic systems (Franka, Kinova, Realman, and Dobot) demonstrate A0's superior performance in complex tasks, showcasing its efficiency, flexibility, and real-world applicability.

Paper Structure

This paper contains 34 sections, 8 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Comparison of different manipulation methods. $A_0$ is an object-centric hierarchical model that learns Embodiment-Agnostic Affordance Representation.
  • Figure 2: The $A_0$ model decomposes robotic manipulation tasks into two levels: (1) high-level spatial affordance understanding and (2) low-level action execution. $A_0$ leverages an Embodiment-Agnostic Affordance Representation to predict object-centric contact points and post-contact trajectories. The architecture includes well-designed key components for affordance learning. $A_0$ is pre-trained on a large-scale dataset of contact points and fine-tuned on annotated trajectories, enabling generalization across diverse robotic platforms. Zoom-in for the best of views.
  • Figure 3: Overview of $A_0$ model. The model is transformer based diffusion probabilistic model to predict the waypoints for robotic manipulation. We use the pre-trained Qwen2.5-7B yang2024qwen2 and SigLip (400M) zhai2023sigmoid to encode the language instruction and images, separately. The image of previous time step are used to provide motion information by the proposed motion token enhancement. The image and text tokens are alternatively injected as conditions via cross attention.
  • Figure 4: Performance of MAE$\downarrow$ with pretraining on three datasets.
  • Figure 5: Evaluation on a range of complex and temporally extended tasks using the Franka Emika robot. The four tasks include opening a drawer, placing an object on a plate, pressing a button, and wiping a whiteboard. We predict 2D affordances and employ the action execution method to deploy them on the robot.
  • ...and 7 more figures