Table of Contents
Fetching ...

LLMs for Robotic Object Disambiguation

Connie Jiang, Yiqing Xu, David Hsu

TL;DR

This paper investigates whether pre-trained large language models (LLMs) can aid robotic object disambiguation in complex tabletop scenes. By applying few-shot prompt engineering, the authors enable the LLM to generate discriminative features and a decision-tree plan to resolve ambiguity, even with occlusions and large object sets. In simple scenes, zero-shot prompting can disambiguate, but few-shot prompts are required for complex scenes; the method achieves a 95.79% accuracy and outperforms enumeration, human, and POMDP-ATTR baselines, while reducing the number of required questions. The results demonstrate that LLM-driven planning can leverage common sense to infer features not explicitly stated and navigate decision trees efficiently, offering a scalable alternative to task-specific POMDPs in robotics, with potential to generalize to generalized user requests and multimodal perception; the approach also suggests a move toward a more efficient $O(n)$-to-logarithmic query-efficiency regime in disambiguation tasks.

Abstract

The advantages of pre-trained large language models (LLMs) are apparent in a variety of language processing tasks. But can a language model's knowledge be further harnessed to effectively disambiguate objects and navigate decision-making challenges within the realm of robotics? Our study reveals the LLM's aptitude for solving complex decision making challenges that are often previously modeled by Partially Observable Markov Decision Processes (POMDPs). A pivotal focus of our research is the object disambiguation capability of LLMs. We detail the integration of an LLM into a tabletop environment disambiguation task, a decision making problem where the robot's task is to discern and retrieve a user's desired object from an arbitrarily large and complex cluster of objects. Despite multiple query attempts with zero-shot prompt engineering (details can be found in the Appendix), the LLM struggled to inquire about features not explicitly provided in the scene description. In response, we have developed a few-shot prompt engineering system to improve the LLM's ability to pose disambiguating queries. The result is a model capable of both using given features when they are available and inferring new relevant features when necessary, to successfully generate and navigate down a precise decision tree to the correct object--even when faced with identical options.

LLMs for Robotic Object Disambiguation

TL;DR

This paper investigates whether pre-trained large language models (LLMs) can aid robotic object disambiguation in complex tabletop scenes. By applying few-shot prompt engineering, the authors enable the LLM to generate discriminative features and a decision-tree plan to resolve ambiguity, even with occlusions and large object sets. In simple scenes, zero-shot prompting can disambiguate, but few-shot prompts are required for complex scenes; the method achieves a 95.79% accuracy and outperforms enumeration, human, and POMDP-ATTR baselines, while reducing the number of required questions. The results demonstrate that LLM-driven planning can leverage common sense to infer features not explicitly stated and navigate decision trees efficiently, offering a scalable alternative to task-specific POMDPs in robotics, with potential to generalize to generalized user requests and multimodal perception; the approach also suggests a move toward a more efficient -to-logarithmic query-efficiency regime in disambiguation tasks.

Abstract

The advantages of pre-trained large language models (LLMs) are apparent in a variety of language processing tasks. But can a language model's knowledge be further harnessed to effectively disambiguate objects and navigate decision-making challenges within the realm of robotics? Our study reveals the LLM's aptitude for solving complex decision making challenges that are often previously modeled by Partially Observable Markov Decision Processes (POMDPs). A pivotal focus of our research is the object disambiguation capability of LLMs. We detail the integration of an LLM into a tabletop environment disambiguation task, a decision making problem where the robot's task is to discern and retrieve a user's desired object from an arbitrarily large and complex cluster of objects. Despite multiple query attempts with zero-shot prompt engineering (details can be found in the Appendix), the LLM struggled to inquire about features not explicitly provided in the scene description. In response, we have developed a few-shot prompt engineering system to improve the LLM's ability to pose disambiguating queries. The result is a model capable of both using given features when they are available and inferring new relevant features when necessary, to successfully generate and navigate down a precise decision tree to the correct object--even when faced with identical options.
Paper Structure (12 sections, 6 figures, 1 table)

This paper contains 12 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Generalization (red): The task given by the user was "Give me something to eat", and thus the user task specification has been generalized (either the apple or the contents inside the blue cup could be eaten, matching the task given by the user). Occluding Objects (blue): the handling of occluding objects (the toothbrush is in the way of the apple, and must be removed first before the apple can be retrieved).
  • Figure 2: Simple Scene Example with Zero-Shot Prompting "There are four cups in a line. Two are blue and two are green. They are of different sizes." In response, the model with zero-shot prompting proposes an action planner with two sequential questions, successfully completing the disambiguation task. The first seeks to understand the user's color preference (<ask> <"Do you prefer a certain color for the cup?">), and the second determines the size of the target object (<ask> <"Do you want a large cup or a small one?">).
  • Figure 3: Complex Scene Example. Fails with Zero-Shot Planning, but Succeeds with Few-Shot Prompting "There are 14 plums stacked in a pyramid on the table. On the bottom of the pyramid is a three by three square arrangement of 9 plums. The second layer rests on top of the bottom layer of 9 plums and consists of a two by two square arrangement of 4 plums. Finally on the top of the pyramid, there is one plum that rests on top of the 4 plums of the second layer."
  • Figure 4: A visualization of a generated decision tree (originally in JSON format) from the trained model. The model’s initial disambiguation question seeks to identify the layer of the target object, thus reducing the options from any of the 14 apples to three distinct categories: apples located in the bottom layer, the middle layer, and the singular apple in the top layer. If the user indicates the bottom layer, the model will further refine its inquiry further by asking about the positioning within the layer (front row, middle row, or back row?). This process of narrowing down continues until the target object is unambiguously identified.
  • Figure 5: Image samples from the 12 scenes used for experiments.
  • ...and 1 more figures