Which objects help me to act effectively? Reasoning about physically-grounded affordances
Anne Kemmeren, Gertjan Burghouts, Michael van Bekkum, Wouter Meijer, Jelle van Mil
TL;DR
The paper tackles open-world affordance detection for robots by integrating a Socratic dialogue between an LLM and a vision-language model, grounding reasoning in robot embodiment and physical object properties. This open-vocabulary framework reasons about which objects enable specific actions toward a goal and how the intended effects constrain object choice. Finetuning the VLM on physical properties substantially improves property detection and generalization to unseen objects, leading to more efficient retrieval of suitable objects among distractors. The approach demonstrates practical viability on a real SPOT robot and highlights the potential to enhance task-driven perception and planning in open environments, while outlining avenues to handle mixed-material objects and richer output formats.
Abstract
For effective interactions with the open world, robots should understand how interactions with known and novel objects help them towards their goal. A key aspect of this understanding lies in detecting an object's affordances, which represent the potential effects that can be achieved by manipulating the object in various ways. Our approach leverages a dialogue of large language models (LLMs) and vision-language models (VLMs) to achieve open-world affordance detection. Given open-vocabulary descriptions of intended actions and effects, the useful objects in the environment are found. By grounding our system in the physical world, we account for the robot's embodiment and the intrinsic properties of the objects it encounters. In our experiments, we have shown that our method produces tailored outputs based on different embodiments or intended effects. The method was able to select a useful object from a set of distractors. Finetuning the VLM for physical properties improved overall performance. These results underline the importance of grounding the affordance search in the physical world, by taking into account robot embodiment and the physical properties of objects.
