Table of Contents
Fetching ...

HM3D-OVON: A Dataset and Benchmark for Open-Vocabulary Object Goal Navigation

Naoki Yokoyama, Ram Ramrakhya, Abhishek Das, Dhruv Batra, Sehoon Ha

TL;DR

HM3D-OVON proposes a large-scale open-vocabulary ObjectNav benchmark built on HM3DSem, expanding to 379 object categories and real-world-like scans to assess generalization to free-form language goals. The authors compare end-to-end (BC, DAgger, RL, DAgRL) and modular approaches (VLFM, OD-augmented variants), finding that DAgRL with frontier exploration and RL fine-tuning achieves top end-to-end performance, while modular VLFM generalizes better to unseen categories thanks to an open-vocabulary detector. Incorporating an open-vocabulary detector with a goal-directed navigation module (DAgRL+OD) yields the strongest overall results across all evaluation splits, including Val Seen Synonyms and Val Unseen, and demonstrates robustness to test-time noise. The work provides detailed analyses of trajectory strategies, temporal encoders, and failure modes, offering practical guidance for building flexible, robust visual-semantic navigation agents capable of locating objects described in free-form language in real indoor environments.

Abstract

We present the Habitat-Matterport 3D Open Vocabulary Object Goal Navigation dataset (HM3D-OVON), a large-scale benchmark that broadens the scope and semantic range of prior Object Goal Navigation (ObjectNav) benchmarks. Leveraging the HM3DSem dataset, HM3D-OVON incorporates over 15k annotated instances of household objects across 379 distinct categories, derived from photo-realistic 3D scans of real-world environments. In contrast to earlier ObjectNav datasets, which limit goal objects to a predefined set of 6-20 categories, HM3D-OVON facilitates the training and evaluation of models with an open-set of goals defined through free-form language at test-time. Through this open-vocabulary formulation, HM3D-OVON encourages progress towards learning visuo-semantic navigation behaviors that are capable of searching for any object specified by text in an open-vocabulary manner. Additionally, we systematically evaluate and compare several different types of approaches on HM3D-OVON. We find that HM3D-OVON can be used to train an open-vocabulary ObjectNav agent that achieves both higher performance and is more robust to localization and actuation noise than the state-of-the-art ObjectNav approach. We hope that our benchmark and baseline results will drive interest in developing embodied agents that can navigate real-world spaces to find household objects specified through free-form language, taking a step towards more flexible and human-like semantic visual navigation. Code and videos available at: naoki.io/ovon.

HM3D-OVON: A Dataset and Benchmark for Open-Vocabulary Object Goal Navigation

TL;DR

HM3D-OVON proposes a large-scale open-vocabulary ObjectNav benchmark built on HM3DSem, expanding to 379 object categories and real-world-like scans to assess generalization to free-form language goals. The authors compare end-to-end (BC, DAgger, RL, DAgRL) and modular approaches (VLFM, OD-augmented variants), finding that DAgRL with frontier exploration and RL fine-tuning achieves top end-to-end performance, while modular VLFM generalizes better to unseen categories thanks to an open-vocabulary detector. Incorporating an open-vocabulary detector with a goal-directed navigation module (DAgRL+OD) yields the strongest overall results across all evaluation splits, including Val Seen Synonyms and Val Unseen, and demonstrates robustness to test-time noise. The work provides detailed analyses of trajectory strategies, temporal encoders, and failure modes, offering practical guidance for building flexible, robust visual-semantic navigation agents capable of locating objects described in free-form language in real indoor environments.

Abstract

We present the Habitat-Matterport 3D Open Vocabulary Object Goal Navigation dataset (HM3D-OVON), a large-scale benchmark that broadens the scope and semantic range of prior Object Goal Navigation (ObjectNav) benchmarks. Leveraging the HM3DSem dataset, HM3D-OVON incorporates over 15k annotated instances of household objects across 379 distinct categories, derived from photo-realistic 3D scans of real-world environments. In contrast to earlier ObjectNav datasets, which limit goal objects to a predefined set of 6-20 categories, HM3D-OVON facilitates the training and evaluation of models with an open-set of goals defined through free-form language at test-time. Through this open-vocabulary formulation, HM3D-OVON encourages progress towards learning visuo-semantic navigation behaviors that are capable of searching for any object specified by text in an open-vocabulary manner. Additionally, we systematically evaluate and compare several different types of approaches on HM3D-OVON. We find that HM3D-OVON can be used to train an open-vocabulary ObjectNav agent that achieves both higher performance and is more robust to localization and actuation noise than the state-of-the-art ObjectNav approach. We hope that our benchmark and baseline results will drive interest in developing embodied agents that can navigate real-world spaces to find household objects specified through free-form language, taking a step towards more flexible and human-like semantic visual navigation. Code and videos available at: naoki.io/ovon.
Paper Structure (15 sections, 2 equations, 4 figures, 6 tables)

This paper contains 15 sections, 2 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: We study the Open-Vocabulary ObjectNav (OVON) task, which involves an agent tasked with navigating to object goals in an open-set, specified through language. In the above example, an agent is tasked with navigating to an 'L-Shaped Couch'.
  • Figure 2: Our OVON policy encodes the current visual observation $I_t$, the goal object category $G$, and the previous action $a_{t-1}$ to form observation embedding $o_t$. At each step, the embedding sequence for the past 100 time steps is fed into a transformer, which uses an action head to sample an action $a_t$.
  • Figure 3: Examples of successes of our DAgRL policy for each evaluation split. DAgRL can efficiently explore the environment, avoid obstacles, and stop in front of the goal object when it is spotted, using only RGB observations. Videos can be found at http://naoki.io/ovon.
  • Figure 4: Failure analysis of DAgRL. As the goal object categories become less similar to those seen in training (i.e., Val Seen, Val Unseen), the agent more frequently fails from timeouts (never calling stop) and more frequently ignores the goal object.