Table of Contents
Fetching ...

YOLOA: Real-Time Affordance Detection via LLM Adapter

Yuqi Ji, Junjie Ke, Lihuo He, Jun Liu, Kaifan Zhang, Yu-Kun Lai, Guiguang Ding, Xinbo Gao

TL;DR

This work tackles the joint perception of 'what-where-how' by unifying object detection and affordance learning into a real-time framework. It introduces YOLOA, a YOLOv11-based detector augmented with a language-guided LLM Adapter that refines class priors, box offsets, and affordance gates during training, while enabling a lightweight YOLOA-light variant for fast inference. The approach achieves state-of-the-art accuracy on relabeled ADG-Det and IIT-Heat benchmarks and demonstrates real-time performance, underscored by extensive ablations validating the mutual enhancement between branches and the utility of language-guided refinements. This work paves the way for practical, semantically aware affordance reasoning in embodied AI, with potential applications in human-robot interaction and real-time robotic manipulation.

Abstract

Affordance detection aims to jointly address the fundamental "what-where-how" challenge in embodied AI by understanding "what" an object is, "where" the object is located, and "how" it can be used. However, most affordance learning methods focus solely on "how" objects can be used while neglecting the "what" and "where" aspects. Other affordance detection methods treat object detection and affordance learning as two independent tasks, lacking effective interaction and real-time capability. To overcome these limitations, we introduce YOLO Affordance (YOLOA), a real-time affordance detection model that jointly handles these two tasks via a large language model (LLM) adapter. Specifically, YOLOA employs a lightweight detector consisting of object detection and affordance learning branches refined through the LLM Adapter. During training, the LLM Adapter interacts with object and affordance preliminary predictions to refine both branches by generating more accurate class priors, box offsets, and affordance gates. Experiments on our relabeled ADG-Det and IIT-Heat benchmarks demonstrate that YOLOA achieves state-of-the-art accuracy (52.8 / 73.1 mAP on ADG-Det / IIT-Heat) while maintaining real-time performance (up to 89.77 FPS, and up to 846.24 FPS for the lightweight variant). This indicates that YOLOA achieves an excellent trade-off between accuracy and efficiency.

YOLOA: Real-Time Affordance Detection via LLM Adapter

TL;DR

This work tackles the joint perception of 'what-where-how' by unifying object detection and affordance learning into a real-time framework. It introduces YOLOA, a YOLOv11-based detector augmented with a language-guided LLM Adapter that refines class priors, box offsets, and affordance gates during training, while enabling a lightweight YOLOA-light variant for fast inference. The approach achieves state-of-the-art accuracy on relabeled ADG-Det and IIT-Heat benchmarks and demonstrates real-time performance, underscored by extensive ablations validating the mutual enhancement between branches and the utility of language-guided refinements. This work paves the way for practical, semantically aware affordance reasoning in embodied AI, with potential applications in human-robot interaction and real-time robotic manipulation.

Abstract

Affordance detection aims to jointly address the fundamental "what-where-how" challenge in embodied AI by understanding "what" an object is, "where" the object is located, and "how" it can be used. However, most affordance learning methods focus solely on "how" objects can be used while neglecting the "what" and "where" aspects. Other affordance detection methods treat object detection and affordance learning as two independent tasks, lacking effective interaction and real-time capability. To overcome these limitations, we introduce YOLO Affordance (YOLOA), a real-time affordance detection model that jointly handles these two tasks via a large language model (LLM) adapter. Specifically, YOLOA employs a lightweight detector consisting of object detection and affordance learning branches refined through the LLM Adapter. During training, the LLM Adapter interacts with object and affordance preliminary predictions to refine both branches by generating more accurate class priors, box offsets, and affordance gates. Experiments on our relabeled ADG-Det and IIT-Heat benchmarks demonstrate that YOLOA achieves state-of-the-art accuracy (52.8 / 73.1 mAP on ADG-Det / IIT-Heat) while maintaining real-time performance (up to 89.77 FPS, and up to 846.24 FPS for the lightweight variant). This indicates that YOLOA achieves an excellent trade-off between accuracy and efficiency.

Paper Structure

This paper contains 23 sections, 11 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Comparison between Affordance Learning and Affordance Detection. The user requests a suitable tool to "cut" the paper. Robot A in (a) only identifies functional regions without recognizing object categories, making it unable to decide which tool to hand over. In contrast, Robot B in (b) integrates object categories, spatial locations, and affordances to accurately locate and identify the correct tool, enabling successful task completion.
  • Figure 2: An overview of the proposed YOLOA. The backbone produces object and affordance predictions, which are integrated through a language-guided adapter. The LLM Adapter enhances both object detection and affordance learning branches through three semantic refinements, namely class priors, box offsets, and affordance gates.
  • Figure 3: Qualitative visualization on the ADG-Det dataset. Each column shows the predictions of a different method (including the input image and ground truth) for a given affordance category, visualized under exocentric (top) and egocentric (bottom) views.
  • Figure 4: The t-SNE visualization on the IIT-Heat dataset, comparing (a) the model without the LLM Adapter and (b) the full configuration to illustrate differences in feature distribution.
  • Figure 5: Qualitative visualization on the IIT-Heat dataset. The affordance categories for each image are listed on the left, with label colors matching those in the corresponding affordance masks on the right. (Each image may include multiple objects and affordances.)
  • ...and 4 more figures