LLMs for Robotic Object Disambiguation
Connie Jiang, Yiqing Xu, David Hsu
TL;DR
This paper investigates whether pre-trained large language models (LLMs) can aid robotic object disambiguation in complex tabletop scenes. By applying few-shot prompt engineering, the authors enable the LLM to generate discriminative features and a decision-tree plan to resolve ambiguity, even with occlusions and large object sets. In simple scenes, zero-shot prompting can disambiguate, but few-shot prompts are required for complex scenes; the method achieves a 95.79% accuracy and outperforms enumeration, human, and POMDP-ATTR baselines, while reducing the number of required questions. The results demonstrate that LLM-driven planning can leverage common sense to infer features not explicitly stated and navigate decision trees efficiently, offering a scalable alternative to task-specific POMDPs in robotics, with potential to generalize to generalized user requests and multimodal perception; the approach also suggests a move toward a more efficient $O(n)$-to-logarithmic query-efficiency regime in disambiguation tasks.
Abstract
The advantages of pre-trained large language models (LLMs) are apparent in a variety of language processing tasks. But can a language model's knowledge be further harnessed to effectively disambiguate objects and navigate decision-making challenges within the realm of robotics? Our study reveals the LLM's aptitude for solving complex decision making challenges that are often previously modeled by Partially Observable Markov Decision Processes (POMDPs). A pivotal focus of our research is the object disambiguation capability of LLMs. We detail the integration of an LLM into a tabletop environment disambiguation task, a decision making problem where the robot's task is to discern and retrieve a user's desired object from an arbitrarily large and complex cluster of objects. Despite multiple query attempts with zero-shot prompt engineering (details can be found in the Appendix), the LLM struggled to inquire about features not explicitly provided in the scene description. In response, we have developed a few-shot prompt engineering system to improve the LLM's ability to pose disambiguating queries. The result is a model capable of both using given features when they are available and inferring new relevant features when necessary, to successfully generate and navigate down a precise decision tree to the correct object--even when faced with identical options.
