Table of Contents
Fetching ...

Which objects help me to act effectively? Reasoning about physically-grounded affordances

Anne Kemmeren, Gertjan Burghouts, Michael van Bekkum, Wouter Meijer, Jelle van Mil

TL;DR

The paper tackles open-world affordance detection for robots by integrating a Socratic dialogue between an LLM and a vision-language model, grounding reasoning in robot embodiment and physical object properties. This open-vocabulary framework reasons about which objects enable specific actions toward a goal and how the intended effects constrain object choice. Finetuning the VLM on physical properties substantially improves property detection and generalization to unseen objects, leading to more efficient retrieval of suitable objects among distractors. The approach demonstrates practical viability on a real SPOT robot and highlights the potential to enhance task-driven perception and planning in open environments, while outlining avenues to handle mixed-material objects and richer output formats.

Abstract

For effective interactions with the open world, robots should understand how interactions with known and novel objects help them towards their goal. A key aspect of this understanding lies in detecting an object's affordances, which represent the potential effects that can be achieved by manipulating the object in various ways. Our approach leverages a dialogue of large language models (LLMs) and vision-language models (VLMs) to achieve open-world affordance detection. Given open-vocabulary descriptions of intended actions and effects, the useful objects in the environment are found. By grounding our system in the physical world, we account for the robot's embodiment and the intrinsic properties of the objects it encounters. In our experiments, we have shown that our method produces tailored outputs based on different embodiments or intended effects. The method was able to select a useful object from a set of distractors. Finetuning the VLM for physical properties improved overall performance. These results underline the importance of grounding the affordance search in the physical world, by taking into account robot embodiment and the physical properties of objects.

Which objects help me to act effectively? Reasoning about physically-grounded affordances

TL;DR

The paper tackles open-world affordance detection for robots by integrating a Socratic dialogue between an LLM and a vision-language model, grounding reasoning in robot embodiment and physical object properties. This open-vocabulary framework reasons about which objects enable specific actions toward a goal and how the intended effects constrain object choice. Finetuning the VLM on physical properties substantially improves property detection and generalization to unseen objects, leading to more efficient retrieval of suitable objects among distractors. The approach demonstrates practical viability on a real SPOT robot and highlights the potential to enhance task-driven perception and planning in open environments, while outlining avenues to handle mixed-material objects and richer output formats.

Abstract

For effective interactions with the open world, robots should understand how interactions with known and novel objects help them towards their goal. A key aspect of this understanding lies in detecting an object's affordances, which represent the potential effects that can be achieved by manipulating the object in various ways. Our approach leverages a dialogue of large language models (LLMs) and vision-language models (VLMs) to achieve open-world affordance detection. Given open-vocabulary descriptions of intended actions and effects, the useful objects in the environment are found. By grounding our system in the physical world, we account for the robot's embodiment and the intrinsic properties of the objects it encounters. In our experiments, we have shown that our method produces tailored outputs based on different embodiments or intended effects. The method was able to select a useful object from a set of distractors. Finetuning the VLM for physical properties improved overall performance. These results underline the importance of grounding the affordance search in the physical world, by taking into account robot embodiment and the physical properties of objects.
Paper Structure (17 sections, 6 equations, 8 figures, 4 tables)

This paper contains 17 sections, 6 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: A socratic dialogue enriched with a physicality-grounded VLM can reason about relevant objects for the given task and action.
  • Figure 2: The dialogue between an LLM (left) and VLM (right) reasons about what object in the given scene would give the quadruped robot the ability to climb to a better viewpoint.
  • Figure 3: Improvements: After adaptation, the VLM's prediction of objects and properties is improved (top row).
  • Figure 4: Object properties: Correcting the wrong properties for several object classes.
  • Figure 5: Errors: Remaining errors such as a Metal Bench was is confused with a Wooden Bench (third image) and a Metal Stool (right) which is a Wooden Stool but it has Metal legs.
  • ...and 3 more figures