Table of Contents
Fetching ...

LEGS-POMDP: Language and Gesture-Guided Object Search in Partially Observable Environments

Ivy Xiao He, Stefanie Tellex, Jason Xinyu Liu

TL;DR

This work introduces LanguagE and Gesture-Guided Object Search in Partially Observable Environments (LEGS-POMDP), a modular POMDP system that integrates language, gesture, and visual observations for open-world object search and explicitly models two sources of partial observability: uncertainty over the target object's identity and its spatial location.

Abstract

To assist humans in open-world environments, robots must interpret ambiguous instructions to locate desired objects. Foundation model-based approaches excel at multimodal grounding, but they lack a principled mechanism for modeling uncertainty in long-horizon tasks. In contrast, Partially Observable Markov Decision Processes (POMDPs) provide a systematic framework for planning under uncertainty but are often limited in supported modalities and rely on restrictive environment assumptions. We introduce LanguagE and Gesture-Guided Object Search in Partially Observable Environments (LEGS-POMDP), a modular POMDP system that integrates language, gesture, and visual observations for open-world object search. Unlike prior work, LEGS-POMDP explicitly models two sources of partial observability: uncertainty over the target object's identity and its spatial location. In simulation, multimodal fusion significantly outperforms unimodal baselines, achieving an average success rate of 89\% across challenging environments and object categories. Finally, we demonstrate the full system on a quadruped mobile manipulator, where real-world experiments qualitatively validate robust multimodal perception and uncertainty reduction under ambiguous instructions.

LEGS-POMDP: Language and Gesture-Guided Object Search in Partially Observable Environments

TL;DR

This work introduces LanguagE and Gesture-Guided Object Search in Partially Observable Environments (LEGS-POMDP), a modular POMDP system that integrates language, gesture, and visual observations for open-world object search and explicitly models two sources of partial observability: uncertainty over the target object's identity and its spatial location.

Abstract

To assist humans in open-world environments, robots must interpret ambiguous instructions to locate desired objects. Foundation model-based approaches excel at multimodal grounding, but they lack a principled mechanism for modeling uncertainty in long-horizon tasks. In contrast, Partially Observable Markov Decision Processes (POMDPs) provide a systematic framework for planning under uncertainty but are often limited in supported modalities and rely on restrictive environment assumptions. We introduce LanguagE and Gesture-Guided Object Search in Partially Observable Environments (LEGS-POMDP), a modular POMDP system that integrates language, gesture, and visual observations for open-world object search. Unlike prior work, LEGS-POMDP explicitly models two sources of partial observability: uncertainty over the target object's identity and its spatial location. In simulation, multimodal fusion significantly outperforms unimodal baselines, achieving an average success rate of 89\% across challenging environments and object categories. Finally, we demonstrate the full system on a quadruped mobile manipulator, where real-world experiments qualitatively validate robust multimodal perception and uncertainty reduction under ambiguous instructions.
Paper Structure (17 sections, 6 equations, 8 figures, 4 tables)

This paper contains 17 sections, 6 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Multimodal fusion with belief updates disambiguates human instructions and identifies the intended object among multiple candidates.
  • Figure 2: system diagram.
  • Figure 3: Example frame showing different vector- and cone-based models of the pointing direction, with the target marked in green.
  • Figure 4: Visual grounding comparison between SoM prompting and a detector baseline (GroundingDINO).
  • Figure 5: Belief convergence in the large environment. (Top) Max-belief traces show how certainty in the most likely state evolves over time. (Bottom) Target-belief traces show probability mass assigned to the true target.
  • ...and 3 more figures