Table of Contents
Fetching ...

AffordanceLLM: Grounding Affordance from Vision Language Models

Shengyi Qian, Weifeng Chen, Min Bai, Xiong Zhou, Zhuowen Tu, Li Erran Li

TL;DR

Affordance grounding from a single image requires integrating detection, localization, 3D geometry, and human-object interaction reasoning. The authors propose AffordanceLLM, which builds on a Vision-Language Model backbone (LLaVA-7B) with a mask-token decoder and augments RGB input with pseudo-depth to generate dense affordance maps, enabling open-world generalization. Evaluated on AGD20K, AffordanceLLM substantially outperforms state-of-the-art baselines, especially on hard/unseen splits, and demonstrates non-trivial generalization to random Internet images and novel actions. The approach highlights the value of importing world knowledge and geometric cues from VLMs for robust affordance reasoning, with potential benefits for robotics and embodied AI, alongside considerations for safe or harmful deployment.

Abstract

Affordance grounding refers to the task of finding the area of an object with which one can interact. It is a fundamental but challenging task, as a successful solution requires the comprehensive understanding of a scene in multiple aspects including detection, localization, and recognition of objects with their parts, of geo-spatial configuration/layout of the scene, of 3D shapes and physics, as well as of the functionality and potential interaction of the objects and humans. Much of the knowledge is hidden and beyond the image content with the supervised labels from a limited training set. In this paper, we make an attempt to improve the generalization capability of the current affordance grounding by taking the advantage of the rich world, abstract, and human-object-interaction knowledge from pretrained large-scale vision language models. Under the AGD20K benchmark, our proposed model demonstrates a significant performance gain over the competing methods for in-the-wild object affordance grounding. We further demonstrate it can ground affordance for objects from random Internet images, even if both objects and actions are unseen during training. Project site: https://jasonqsy.github.io/AffordanceLLM/

AffordanceLLM: Grounding Affordance from Vision Language Models

TL;DR

Affordance grounding from a single image requires integrating detection, localization, 3D geometry, and human-object interaction reasoning. The authors propose AffordanceLLM, which builds on a Vision-Language Model backbone (LLaVA-7B) with a mask-token decoder and augments RGB input with pseudo-depth to generate dense affordance maps, enabling open-world generalization. Evaluated on AGD20K, AffordanceLLM substantially outperforms state-of-the-art baselines, especially on hard/unseen splits, and demonstrates non-trivial generalization to random Internet images and novel actions. The approach highlights the value of importing world knowledge and geometric cues from VLMs for robust affordance reasoning, with potential benefits for robotics and embodied AI, alongside considerations for safe or harmful deployment.

Abstract

Affordance grounding refers to the task of finding the area of an object with which one can interact. It is a fundamental but challenging task, as a successful solution requires the comprehensive understanding of a scene in multiple aspects including detection, localization, and recognition of objects with their parts, of geo-spatial configuration/layout of the scene, of 3D shapes and physics, as well as of the functionality and potential interaction of the objects and humans. Much of the knowledge is hidden and beyond the image content with the supervised labels from a limited training set. In this paper, we make an attempt to improve the generalization capability of the current affordance grounding by taking the advantage of the rich world, abstract, and human-object-interaction knowledge from pretrained large-scale vision language models. Under the AGD20K benchmark, our proposed model demonstrates a significant performance gain over the competing methods for in-the-wild object affordance grounding. We further demonstrate it can ground affordance for objects from random Internet images, even if both objects and actions are unseen during training. Project site: https://jasonqsy.github.io/AffordanceLLM/
Paper Structure (16 sections, 8 equations, 6 figures, 4 tables)

This paper contains 16 sections, 8 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: State-of-the-art vision language models, such as LLaVA liu2023llava, has rich human-object-interaction knowledge, thanks to the large-scale text pretraining. Given a question about how to interact with an object, it typically gives a reasonable solution.
  • Figure 2: Overview of AffordanceLLM. The inputs of our model includes a single image and a text prompt related to interaction. We use OWL-ViT minderer2022simple as the image encoder to generate image features and project it into the same hidden dimension as the large language model. As well, we use a tokenizer to encode the text prompt. The text features and image features are concatenated together and feed into the LLM. The LLM is fine-tuned to predict a special token, which is used as a query to the mask decoder to generate the final affordance map.
  • Figure 3: Qualitative results on the test set of the hard split. LOCATE-Sup fails to learn a reasonable affordance map due to limited training data. LOCATE li2023locate typically predicts an affordance map which covers the whole object. 3DOI qian2023understanding focuses on a small area of the object. Overall, our approach produces the best-quality affordance predictions.
  • Figure 4: Ablation of different text prompts and depth. Ours w/o depth is our approach without pseudodepth as additional inputs. Ours is our full approach. We find constructing the correct text prompt typically helps our model to focus on the correct area. We believe it is because the correct text prompt would activate the world knowledge related to affordance embedded in the VLM.
  • Figure 5: Generalization results on random Internet images. We show the most similar objects in the training set to demonstrate how different the objects are from the ones in the training set. (Row 1, 2): AffordanceLLM generalizes to novel objects from random Internet images, while LOCATE li2023locate fails. (Row 3, 4): AffordanceLLM generalizes to novel actions plus novel objects. LOCATE cannot infer novel actions thus we left it blank.
  • ...and 1 more figures