AffordanceLLM: Grounding Affordance from Vision Language Models
Shengyi Qian, Weifeng Chen, Min Bai, Xiong Zhou, Zhuowen Tu, Li Erran Li
TL;DR
Affordance grounding from a single image requires integrating detection, localization, 3D geometry, and human-object interaction reasoning. The authors propose AffordanceLLM, which builds on a Vision-Language Model backbone (LLaVA-7B) with a mask-token decoder and augments RGB input with pseudo-depth to generate dense affordance maps, enabling open-world generalization. Evaluated on AGD20K, AffordanceLLM substantially outperforms state-of-the-art baselines, especially on hard/unseen splits, and demonstrates non-trivial generalization to random Internet images and novel actions. The approach highlights the value of importing world knowledge and geometric cues from VLMs for robust affordance reasoning, with potential benefits for robotics and embodied AI, alongside considerations for safe or harmful deployment.
Abstract
Affordance grounding refers to the task of finding the area of an object with which one can interact. It is a fundamental but challenging task, as a successful solution requires the comprehensive understanding of a scene in multiple aspects including detection, localization, and recognition of objects with their parts, of geo-spatial configuration/layout of the scene, of 3D shapes and physics, as well as of the functionality and potential interaction of the objects and humans. Much of the knowledge is hidden and beyond the image content with the supervised labels from a limited training set. In this paper, we make an attempt to improve the generalization capability of the current affordance grounding by taking the advantage of the rich world, abstract, and human-object-interaction knowledge from pretrained large-scale vision language models. Under the AGD20K benchmark, our proposed model demonstrates a significant performance gain over the competing methods for in-the-wild object affordance grounding. We further demonstrate it can ground affordance for objects from random Internet images, even if both objects and actions are unseen during training. Project site: https://jasonqsy.github.io/AffordanceLLM/
