Table of Contents
Fetching ...

Affordance Perception by a Knowledge-Guided Vision-Language Model with Efficient Error Correction

Gertjan Burghouts, Marianne Schaaphok, Michael van Bekkum, Wouter Meijer, Fieke Hillerström, Jelle van Mil

TL;DR

This work tackles open-world robotic affordance understanding, where fine-grained object-action distinctions (e.g., doorknob vs. handle) are needed for actionable manipulation. It presents a modular pipeline that couples a TypeDB-based knowledge graph of affordances with a vision-language detector (GLIP/GLIPv2) and a neuro-symbolic spatial-reasoning module, augmented by a sparse human-in-the-loop for rapid label corrections. The threefold contribution includes a knowledge-base affordance representation, VLM prompting to generalize to unseen objects, and an efficient human-in-the-loop feedback mechanism to restore fine-grained discrimination. Experiments in office-building scenarios show substantial improvement in localization and labeling performance (mAP) when using relabeling and spatial reasoning, enabling more reliable action selection (e.g., pushing vs turning) for door-related tasks. This approach reduces annotation burden while improving open-world robot autonomy in manipulation tasks.

Abstract

Mobile robot platforms will increasingly be tasked with activities that involve grasping and manipulating objects in open world environments. Affordance understanding provides a robot with means to realise its goals and execute its tasks, e.g. to achieve autonomous navigation in unknown buildings where it has to find doors and ways to open these. In order to get actionable suggestions, robots need to be able to distinguish subtle differences between objects, as they may result in different action sequences: doorknobs require grasp and twist, while handlebars require grasp and push. In this paper, we improve affordance perception for a robot in an open-world setting. Our contribution is threefold: (1) We provide an affordance representation with precise, actionable affordances; (2) We connect this knowledge base to a foundational vision-language models (VLM) and prompt the VLM for a wider variety of new and unseen objects; (3) We apply a human-in-the-loop for corrections on the output of the VLM. The mix of affordance representation, image detection and a human-in-the-loop is effective for a robot to search for objects to achieve its goals. We have demonstrated this in a scenario of finding various doors and the many different ways to open them.

Affordance Perception by a Knowledge-Guided Vision-Language Model with Efficient Error Correction

TL;DR

This work tackles open-world robotic affordance understanding, where fine-grained object-action distinctions (e.g., doorknob vs. handle) are needed for actionable manipulation. It presents a modular pipeline that couples a TypeDB-based knowledge graph of affordances with a vision-language detector (GLIP/GLIPv2) and a neuro-symbolic spatial-reasoning module, augmented by a sparse human-in-the-loop for rapid label corrections. The threefold contribution includes a knowledge-base affordance representation, VLM prompting to generalize to unseen objects, and an efficient human-in-the-loop feedback mechanism to restore fine-grained discrimination. Experiments in office-building scenarios show substantial improvement in localization and labeling performance (mAP) when using relabeling and spatial reasoning, enabling more reliable action selection (e.g., pushing vs turning) for door-related tasks. This approach reduces annotation burden while improving open-world robot autonomy in manipulation tasks.

Abstract

Mobile robot platforms will increasingly be tasked with activities that involve grasping and manipulating objects in open world environments. Affordance understanding provides a robot with means to realise its goals and execute its tasks, e.g. to achieve autonomous navigation in unknown buildings where it has to find doors and ways to open these. In order to get actionable suggestions, robots need to be able to distinguish subtle differences between objects, as they may result in different action sequences: doorknobs require grasp and twist, while handlebars require grasp and push. In this paper, we improve affordance perception for a robot in an open-world setting. Our contribution is threefold: (1) We provide an affordance representation with precise, actionable affordances; (2) We connect this knowledge base to a foundational vision-language models (VLM) and prompt the VLM for a wider variety of new and unseen objects; (3) We apply a human-in-the-loop for corrections on the output of the VLM. The mix of affordance representation, image detection and a human-in-the-loop is effective for a robot to search for objects to achieve its goals. We have demonstrated this in a scenario of finding various doors and the many different ways to open them.
Paper Structure (16 sections, 6 equations, 7 figures)

This paper contains 16 sections, 6 equations, 7 figures.

Figures (7)

  • Figure 1: Affordance detection architecture
  • Figure 2: Basic affordance representation using three relations, effect relation, affordance relation and action relation
  • Figure 3: Affordance representation, where the robot SPOT gives a cup to another person, with the effect that this person holds the cup.
  • Figure 4: Standard GLIP is incapable of fine-grained discrimination of various door openers.
  • Figure 5: In the overview of all detected objects, the main classes can be identified quickly, as shown by the new labels that were assigned to them by a user.
  • ...and 2 more figures