Table of Contents
Fetching ...

Discovering Object Attributes by Prompting Large Language Models with Perception-Action APIs

Angelos Mavrogiannis, Dehao Yuan, Yiannis Aloimonos

TL;DR

This paper tackles grounding natural language instructions to the physical world, with emphasis on non-visual object attributes such as weight. It introduces a perception-action API that combines VLMs, LLMs, and robot control functions to generate executable programs for active attribute detection, enabling embodied reasoning beyond static vision. The authors demonstrate improved grounding on spatial and weight-related tasks through offline, simulated, and real-robot experiments, including an end-to-end demonstration on a DJI RoboMaster EP. The work advances embodied attribute detection by integrating visual reasoning, language-based planning, and active perception into a cohesive framework with practical robotic applications.

Abstract

There has been a lot of interest in grounding natural language to physical entities through visual context. While Vision Language Models (VLMs) can ground linguistic instructions to visual sensory information, they struggle with grounding non-visual attributes, like the weight of an object. Our key insight is that non-visual attribute detection can be effectively achieved by active perception guided by visual reasoning. To this end, we present a perception-action API that consists of VLMs and Large Language Models (LLMs) as backbones, together with a set of robot control functions. When prompted with this API and a natural language query, an LLM generates a program to actively identify attributes given an input image. Offline testing on the Odd-One-Out dataset demonstrates that our framework outperforms vanilla VLMs in detecting attributes like relative object location, size, and weight. Online testing in realistic household scenes on AI2-THOR and a real robot demonstration on a DJI RoboMaster EP robot highlight the efficacy of our approach.

Discovering Object Attributes by Prompting Large Language Models with Perception-Action APIs

TL;DR

This paper tackles grounding natural language instructions to the physical world, with emphasis on non-visual object attributes such as weight. It introduces a perception-action API that combines VLMs, LLMs, and robot control functions to generate executable programs for active attribute detection, enabling embodied reasoning beyond static vision. The authors demonstrate improved grounding on spatial and weight-related tasks through offline, simulated, and real-robot experiments, including an end-to-end demonstration on a DJI RoboMaster EP. The work advances embodied attribute detection by integrating visual reasoning, language-based planning, and active perception into a cohesive framework with practical robotic applications.

Abstract

There has been a lot of interest in grounding natural language to physical entities through visual context. While Vision Language Models (VLMs) can ground linguistic instructions to visual sensory information, they struggle with grounding non-visual attributes, like the weight of an object. Our key insight is that non-visual attribute detection can be effectively achieved by active perception guided by visual reasoning. To this end, we present a perception-action API that consists of VLMs and Large Language Models (LLMs) as backbones, together with a set of robot control functions. When prompted with this API and a natural language query, an LLM generates a program to actively identify attributes given an input image. Offline testing on the Odd-One-Out dataset demonstrates that our framework outperforms vanilla VLMs in detecting attributes like relative object location, size, and weight. Online testing in realistic household scenes on AI2-THOR and a real robot demonstration on a DJI RoboMaster EP robot highlight the efficacy of our approach.
Paper Structure (16 sections, 6 figures, 1 table)

This paper contains 16 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Demonstration of our perception-action API solving a minimum distance query on a real robot (left) and a minimum weight query in simulation (right). The LLM receives a perception-action API and a natural language query as input (top). It then generates code that invokes API functions leveraging on-board sensors (camera, distance sensor, force/torque sensor) to actively identify these attributes.
  • Figure 2: We describe our end-to-end framework for embodied attribute detection. The LLM receives as input a perception API with LLMs and VLMs as backbones, an action API based on a Robot Control API, a natural language (NL) instruction from a user, and a visual scene observation. It then produces a python program that combines LLM and VLM function calls with robot actions to actively reason about attribute detection.
  • Figure 3: The accuracy of OVD (GLIP), VQA (BLIP-2), and VQA$+$GPT in determining the heaviest object in an image.
  • Figure 4: We compare the accuracy of OVD-only (GLIP) with (OVD$+$GPT) on our location (left) and size (right) datasets.
  • Figure 5: An example where OVD and VQA fail to identify the heaviest object in the image (✗), but the API prompt-generated code (VQA$+$GPT) returns the correct answer (✓).
  • ...and 1 more figures