Table of Contents
Fetching ...

OVAMOS: A Framework for Open-Vocabulary Multi-Object Search in Unknown Environments

Qianwei Wang, Yifan Xu, Vineet Kamat, Carol Menassa

TL;DR

OVAMOS tackles open-vocabulary multi-object search in unknown environments by integrating vision-language model reasoning with frontier-based exploration and POMDP planning. The framework builds a multi-layer value map from VLM cues, applies a Bayesian-inspired decay to downweight regions after missed detections, clusters high-value regions with DBSCAN, and uses POUCT to plan actions that balance targeted search with exploration. It achieves robust recovery from occlusions and detector failures, demonstrated across 120 simulated HM3D episodes and a 50 m^2 real-world office experiment, with significant gains in both success rate and efficiency over strong baselines. Collectively, OVAMOS advances scalable, robust MOS in novel environments, offering practical improvements for indoor robotic search and retrieval tasks.

Abstract

Object search is a fundamental task for robots deployed in indoor building environments, yet challenges arise due to observation instability, especially for open-vocabulary models. While foundation models (LLMs/VLMs) enable reasoning about object locations even without direct visibility, the ability to recover from failures and replan remains crucial. The Multi-Object Search (MOS) problem further increases complexity, requiring the tracking multiple objects and thorough exploration in novel environments, making observation uncertainty a significant obstacle. To address these challenges, we propose a framework integrating VLM-based reasoning, frontier-based exploration, and a Partially Observable Markov Decision Process (POMDP) framework to solve the MOS problem in novel environments. VLM enhances search efficiency by inferring object-environment relationships, frontier-based exploration guides navigation in unknown spaces, and POMDP models observation uncertainty, allowing recovery from failures in occlusion and cluttered environments. We evaluate our framework on 120 simulated scenarios across several Habitat-Matterport3D (HM3D) scenes and a real-world robot experiment in a 50-square-meter office, demonstrating significant improvements in both efficiency and success rate over baseline methods.

OVAMOS: A Framework for Open-Vocabulary Multi-Object Search in Unknown Environments

TL;DR

OVAMOS tackles open-vocabulary multi-object search in unknown environments by integrating vision-language model reasoning with frontier-based exploration and POMDP planning. The framework builds a multi-layer value map from VLM cues, applies a Bayesian-inspired decay to downweight regions after missed detections, clusters high-value regions with DBSCAN, and uses POUCT to plan actions that balance targeted search with exploration. It achieves robust recovery from occlusions and detector failures, demonstrated across 120 simulated HM3D episodes and a 50 m^2 real-world office experiment, with significant gains in both success rate and efficiency over strong baselines. Collectively, OVAMOS advances scalable, robust MOS in novel environments, offering practical improvements for indoor robotic search and retrieval tasks.

Abstract

Object search is a fundamental task for robots deployed in indoor building environments, yet challenges arise due to observation instability, especially for open-vocabulary models. While foundation models (LLMs/VLMs) enable reasoning about object locations even without direct visibility, the ability to recover from failures and replan remains crucial. The Multi-Object Search (MOS) problem further increases complexity, requiring the tracking multiple objects and thorough exploration in novel environments, making observation uncertainty a significant obstacle. To address these challenges, we propose a framework integrating VLM-based reasoning, frontier-based exploration, and a Partially Observable Markov Decision Process (POMDP) framework to solve the MOS problem in novel environments. VLM enhances search efficiency by inferring object-environment relationships, frontier-based exploration guides navigation in unknown spaces, and POMDP models observation uncertainty, allowing recovery from failures in occlusion and cluttered environments. We evaluate our framework on 120 simulated scenarios across several Habitat-Matterport3D (HM3D) scenes and a real-world robot experiment in a 50-square-meter office, demonstrating significant improvements in both efficiency and success rate over baseline methods.

Paper Structure

This paper contains 24 sections, 13 equations, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: A TurtleBot deployed with OVAMOS searching for multiple objects in a cluttered office environment. Initially, the detector fails to recognize a partially occluded water bottle. Instead of immediately exploring new frontiers, OVAMOS leverages VLM-guided information to prioritize candidate points, gathering additional viewpoints in high-value regions. This strategy ultimately enables the successful detection of the target object.
  • Figure 2: OVAMOS consists of a mapping module, a planning module, and a navigation controller. The mapping module processes RGB-D inputs and textual prompts to generate an object-value map, which integrates detected objects and estimated potential object locations. If a target object is found in the object map, the robot navigates directly to it; otherwise, it relies on the value map to guide exploration. Additionally, the module constructs an obstacle map for navigation constraints and a frontier map to identify unexplored areas. The planning module maintains a belief representation of object locations and employs the POUCT algorithm to simulate and evaluate action sequences, selecting the one with the highest expected reward for execution. The navigation controller receives a target location from the planning module and outputs discrete movement commands (move forward, turn left, turn right) to guide the robot toward its destination.
  • Figure 3: Qualitative comparison between OVAMOS and Finder finder in simulation..
  • Figure 4: Qualitative result of OVAMOS in real-world experiment.