RoboAfford++: A Generative AI-Enhanced Dataset for Multimodal Affordance Learning in Robotic Manipulation and Navigation

Xiaoshuai Hao; Yingbo Tang; Lingfeng Zhang; Yanbiao Ma; Yunfeng Diao; Ziyu Jia; Wenbo Ding; Hangjun Ye; Long Chen

RoboAfford++: A Generative AI-Enhanced Dataset for Multimodal Affordance Learning in Robotic Manipulation and Navigation

Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Yanbiao Ma, Yunfeng Diao, Ziyu Jia, Wenbo Ding, Hangjun Ye, Long Chen

TL;DR

RoboAfford++ introduces a generative AI–enhanced, large-scale dataset that unifies object affordances (recognition and prediction) with spatial affordances for robotic manipulation and navigation, addressing the gap in fine-grained grounding of actionable regions. The authors pair RoboAfford++ with RoboAfford-Eval, a 338-question benchmark to rigorously assess affordance reasoning in real-world scenarios. They propose RoboAfford-Qwen++, a multimodal model fine-tuned on the dataset, which leverages depth for 3D grounding and delivers strong improvements over existing vision-language models on both manipulation and navigation tasks. The results demonstrate notable gains in affordance reasoning and real-world execution, highlighting the dataset’s practical impact for grounding high-level planning in actionable robot behavior.

Abstract

Robotic manipulation and navigation are fundamental capabilities of embodied intelligence, enabling effective robot interactions with the physical world. Achieving these capabilities requires a cohesive understanding of the environment, including object recognition to localize target objects, object affordances to identify potential interaction areas and spatial affordances to discern optimal areas for both object placement and robot movement. While Vision-Language Models (VLMs) excel at high-level task planning and scene understanding, they often struggle to infer actionable positions for physical interaction, such as functional grasping points and permissible placement regions. This limitation stems from the lack of fine-grained annotations for object and spatial affordances in their training datasets. To tackle this challenge, we introduce RoboAfford++, a generative AI-enhanced dataset for multimodal affordance learning for both robotic manipulation and navigation. Our dataset comprises 869,987 images paired with 2.0 million question answering (QA) annotations, covering three critical tasks: object affordance recognition to identify target objects based on attributes and spatial relationships, object affordance prediction to pinpoint functional parts for manipulation, and spatial affordance localization to identify free space for object placement and robot navigation. Complementing this dataset, we propose RoboAfford-Eval, a comprehensive benchmark for assessing affordance-aware prediction in real-world scenarios, featuring 338 meticulously annotated samples across the same three tasks. Extensive experimental results reveal the deficiencies of existing VLMs in affordance learning, while fine-tuning on the RoboAfford++ dataset significantly enhances their ability to reason about object and spatial affordances, validating the dataset's effectiveness.

RoboAfford++: A Generative AI-Enhanced Dataset for Multimodal Affordance Learning in Robotic Manipulation and Navigation

TL;DR

Abstract

RoboAfford++: A Generative AI-Enhanced Dataset for Multimodal Affordance Learning in Robotic Manipulation and Navigation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)