Table of Contents
Fetching ...

OVExp: Open Vocabulary Exploration for Object-Oriented Navigation

Meng Wei, Tai Wang, Yilun Chen, Hanqing Wang, Jiangmiao Pang, Xihui Liu

TL;DR

This work tackles open-vocabulary object navigation by integrating Vision-Language Models into a modular framework that builds semantic top-down maps from RGB-D observations. OVExp trains a goal-conditioned exploration policy on language-based maps and switches to vision-based maps at inference, enabling robust generalization to unseen objects and multimodal goals while reducing training costs. A lightweight transformer encoder-decoder couples map features with goal embeddings, and an analytical local planner (Fast Marching Method) translates predictions into executable paths. Experiments demonstrate strong zero-shot, cross-dataset, and cross-modality performance across HM3D and MP3D benchmarks, highlighting practical potential for scalable, open-world navigation with limited supervision.

Abstract

Object-oriented embodied navigation aims to locate specific objects, defined by category or depicted in images. Existing methods often struggle to generalize to open vocabulary goals without extensive training data. While recent advances in Vision-Language Models (VLMs) offer a promising solution by extending object recognition beyond predefined categories, efficient goal-oriented exploration becomes more challenging in an open vocabulary setting. We introduce OVExp, a learning-based framework that integrates VLMs for Open-Vocabulary Exploration. OVExp constructs scene representations by encoding observations with VLMs and projecting them onto top-down maps for goal-conditioned exploration. Goals are encoded in the same VLM feature space, and a lightweight transformer-based decoder predicts target locations while maintaining versatile representation abilities. To address the impracticality of fusing dense pixel embeddings with full 3D scene reconstruction for training, we propose constructing maps using low-cost semantic categories and transforming them into CLIP's embedding space via the text encoder. The simple but effective design of OVExp significantly reduces computational costs and demonstrates strong generalization abilities to various navigation settings. Experiments on established benchmarks show OVExp outperforms previous zero-shot methods, can generalize to diverse scenes, and handle different goal modalities.

OVExp: Open Vocabulary Exploration for Object-Oriented Navigation

TL;DR

This work tackles open-vocabulary object navigation by integrating Vision-Language Models into a modular framework that builds semantic top-down maps from RGB-D observations. OVExp trains a goal-conditioned exploration policy on language-based maps and switches to vision-based maps at inference, enabling robust generalization to unseen objects and multimodal goals while reducing training costs. A lightweight transformer encoder-decoder couples map features with goal embeddings, and an analytical local planner (Fast Marching Method) translates predictions into executable paths. Experiments demonstrate strong zero-shot, cross-dataset, and cross-modality performance across HM3D and MP3D benchmarks, highlighting practical potential for scalable, open-world navigation with limited supervision.

Abstract

Object-oriented embodied navigation aims to locate specific objects, defined by category or depicted in images. Existing methods often struggle to generalize to open vocabulary goals without extensive training data. While recent advances in Vision-Language Models (VLMs) offer a promising solution by extending object recognition beyond predefined categories, efficient goal-oriented exploration becomes more challenging in an open vocabulary setting. We introduce OVExp, a learning-based framework that integrates VLMs for Open-Vocabulary Exploration. OVExp constructs scene representations by encoding observations with VLMs and projecting them onto top-down maps for goal-conditioned exploration. Goals are encoded in the same VLM feature space, and a lightweight transformer-based decoder predicts target locations while maintaining versatile representation abilities. To address the impracticality of fusing dense pixel embeddings with full 3D scene reconstruction for training, we propose constructing maps using low-cost semantic categories and transforming them into CLIP's embedding space via the text encoder. The simple but effective design of OVExp significantly reduces computational costs and demonstrates strong generalization abilities to various navigation settings. Experiments on established benchmarks show OVExp outperforms previous zero-shot methods, can generalize to diverse scenes, and handle different goal modalities.
Paper Structure (20 sections, 6 equations, 4 figures, 5 tables)

This paper contains 20 sections, 6 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Our learning-based navigation framework enables Open Vocabulary Exploration. Trained with object goals, it generalizes effectively to unseen objects, image goals, and novel scenes, demonstrating robust versatility in diverse navigation tasks.
  • Figure 2: The overall framework of OVExp for open vocabulary object-oriented exploration. OVExp can accept either language-based or vision-based maps as input and accommodates textual and visual object goals. For simplicity, the goal identification model is omitted.
  • Figure 3: Qualitative results of Zero-Shot object navigation on HM3D-ObjectNav. First row: The navigation trajectory of FBE. Second row: The navigation trajectory of OVExp-ZS.
  • Figure 4: Qualitative results of Cross-Modality object navigation on HM3D-InstanceImageNav.