LEGS-POMDP: Language and Gesture-Guided Object Search in Partially Observable Environments

Ivy Xiao He; Stefanie Tellex; Jason Xinyu Liu

LEGS-POMDP: Language and Gesture-Guided Object Search in Partially Observable Environments

Ivy Xiao He, Stefanie Tellex, Jason Xinyu Liu

TL;DR

This work introduces LanguagE and Gesture-Guided Object Search in Partially Observable Environments (LEGS-POMDP), a modular POMDP system that integrates language, gesture, and visual observations for open-world object search and explicitly models two sources of partial observability: uncertainty over the target object's identity and its spatial location.

Abstract

To assist humans in open-world environments, robots must interpret ambiguous instructions to locate desired objects. Foundation model-based approaches excel at multimodal grounding, but they lack a principled mechanism for modeling uncertainty in long-horizon tasks. In contrast, Partially Observable Markov Decision Processes (POMDPs) provide a systematic framework for planning under uncertainty but are often limited in supported modalities and rely on restrictive environment assumptions. We introduce LanguagE and Gesture-Guided Object Search in Partially Observable Environments (LEGS-POMDP), a modular POMDP system that integrates language, gesture, and visual observations for open-world object search. Unlike prior work, LEGS-POMDP explicitly models two sources of partial observability: uncertainty over the target object's identity and its spatial location. In simulation, multimodal fusion significantly outperforms unimodal baselines, achieving an average success rate of 89\% across challenging environments and object categories. Finally, we demonstrate the full system on a quadruped mobile manipulator, where real-world experiments qualitatively validate robust multimodal perception and uncertainty reduction under ambiguous instructions.

LEGS-POMDP: Language and Gesture-Guided Object Search in Partially Observable Environments

TL;DR

Abstract

Paper Structure (17 sections, 6 equations, 8 figures, 4 tables)

This paper contains 17 sections, 6 equations, 8 figures, 4 tables.

Introduction
Related Work
Technical Approach
POMDP Formulation
Multimodal Observation Model
Visual Observation.
Language Observation.
Gesture Observation.
Evaluation of LEGS-POMDP
Modular Evaluation
Gesture Grounding
Visual Grounding
System Evaluation
Solver Comparison
Modality Evaluation
...and 2 more sections

Figures (8)

Figure 1: Multimodal fusion with belief updates disambiguates human instructions and identifies the intended object among multiple candidates.
Figure 2: system diagram.
Figure 3: Example frame showing different vector- and cone-based models of the pointing direction, with the target marked in green.
Figure 4: Visual grounding comparison between SoM prompting and a detector baseline (GroundingDINO).
Figure 5: Belief convergence in the large environment. (Top) Max-belief traces show how certainty in the most likely state evolves over time. (Bottom) Target-belief traces show probability mass assigned to the true target.
...and 3 more figures

LEGS-POMDP: Language and Gesture-Guided Object Search in Partially Observable Environments

TL;DR

Abstract

LEGS-POMDP: Language and Gesture-Guided Object Search in Partially Observable Environments

Authors

TL;DR

Abstract

Table of Contents

Figures (8)