Table of Contents
Fetching ...

Affordance-R1: Reinforcement Learning for Generalizable Affordance Reasoning in Multimodal Large Language Model

Hanqing Wang, Shaoyang Wang, Yiming Zhong, Zemin Yang, Jiamin Wang, Zhiqing Cui, Jiahao Yuan, Yifan Han, Mingyu Liu, Yuexin Ma

TL;DR

Affordance-R1 introduces a GRPO-based reinforcement learning framework to endow multimodal LLMs with explicit, test-time affordance reasoning. A novel ReasonAff dataset supports reasoning-oriented instruction tuning without supervised reasoning data, while a two-stage architecture enables grounding via bounding boxes/points followed by mask generation. The key novelty lies in a composite reward (format, perception, and recognition) plus a rethinking mechanism that yields robust zero-shot generalization and emergent reasoning capabilities, surpassing strong baselines on ReasonAff and OOD datasets. The work demonstrates the potential for purely RL-driven, reasoning-enabled affordance grounding in embodied perception, and provides code and data to foster further development in real-world scenarios.

Abstract

Affordance grounding focuses on predicting the specific regions of objects that are associated with the actions to be performed by robots. It plays a vital role in the fields of human-robot interaction, human-object interaction, embodied manipulation, and embodied perception. Existing models often neglect the affordance shared among different objects because they lack the Chain-of-Thought(CoT) reasoning abilities, limiting their out-of-domain (OOD) generalization and explicit reasoning capabilities. To address these challenges, we propose Affordance-R1, the first unified affordance grounding framework that integrates cognitive CoT guided Group Relative Policy Optimization (GRPO) within a reinforcement learning paradigm. Specifically, we designed a sophisticated affordance function, which contains format, perception, and cognition rewards to effectively guide optimization directions. Furthermore, we constructed a high-quality affordance-centric reasoning dataset, ReasonAff, to support training. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Affordance-R1 achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities. Comprehensive experiments demonstrate that our model outperforms well-established methods and exhibits open-world generalization. To the best of our knowledge, Affordance-R1 is the first to integrate GRPO-based RL with reasoning into affordance reasoning. The code of our method and our dataset is released on https://github.com/hq-King/Affordance-R1.

Affordance-R1: Reinforcement Learning for Generalizable Affordance Reasoning in Multimodal Large Language Model

TL;DR

Affordance-R1 introduces a GRPO-based reinforcement learning framework to endow multimodal LLMs with explicit, test-time affordance reasoning. A novel ReasonAff dataset supports reasoning-oriented instruction tuning without supervised reasoning data, while a two-stage architecture enables grounding via bounding boxes/points followed by mask generation. The key novelty lies in a composite reward (format, perception, and recognition) plus a rethinking mechanism that yields robust zero-shot generalization and emergent reasoning capabilities, surpassing strong baselines on ReasonAff and OOD datasets. The work demonstrates the potential for purely RL-driven, reasoning-enabled affordance grounding in embodied perception, and provides code and data to foster further development in real-world scenarios.

Abstract

Affordance grounding focuses on predicting the specific regions of objects that are associated with the actions to be performed by robots. It plays a vital role in the fields of human-robot interaction, human-object interaction, embodied manipulation, and embodied perception. Existing models often neglect the affordance shared among different objects because they lack the Chain-of-Thought(CoT) reasoning abilities, limiting their out-of-domain (OOD) generalization and explicit reasoning capabilities. To address these challenges, we propose Affordance-R1, the first unified affordance grounding framework that integrates cognitive CoT guided Group Relative Policy Optimization (GRPO) within a reinforcement learning paradigm. Specifically, we designed a sophisticated affordance function, which contains format, perception, and cognition rewards to effectively guide optimization directions. Furthermore, we constructed a high-quality affordance-centric reasoning dataset, ReasonAff, to support training. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Affordance-R1 achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities. Comprehensive experiments demonstrate that our model outperforms well-established methods and exhibits open-world generalization. To the best of our knowledge, Affordance-R1 is the first to integrate GRPO-based RL with reasoning into affordance reasoning. The code of our method and our dataset is released on https://github.com/hq-King/Affordance-R1.

Paper Structure

This paper contains 34 sections, 4 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Affordance-R1 demonstrates extraordinary affordance reasoning ability and powerful generalization ability.
  • Figure 2: Affordance reasoning instruction generation and comparison. (a) Comparison between grounding-based and reasoning-based instructions. Instruction A directly asks for the faucet handle location (simple grounding), while Instruction B asks how to interact with the faucet to achieve opening (requires reasoning). (b) Pipeline for generating affordance reasoning instructions using GPT-4o to rewrite original instructions based on exo images, HOI images, and system prompts with guidelines for diversity, daily tasks, and leakage avoidance. The used prompt and statistical information of ReasonAff can be seen in our Appendix.
  • Figure 3: Comparison of instructions and reasoning outputs between ReasonAff and Instruct-Part datasets on the same images.
  • Figure 4: Affordance-R1 framework overview. The model processes queries through policy-based reasoning with $<think>$ and $<rethink>$ stages to generate affordance predictions. The policy optimization uses a sophisticated reward system comprising (a) format rewards for reasoning structure, (b) perception rewards for spatial accuracy (Box-Num, IOU, L1), and (c) recognition rewards for semantic similarity, enabling effective GRPO-based training for affordance reasoning.
  • Figure 5: Qualitative Comparison of Affordance Reasoning
  • ...and 9 more figures