HM3D-OVON: A Dataset and Benchmark for Open-Vocabulary Object Goal Navigation
Naoki Yokoyama, Ram Ramrakhya, Abhishek Das, Dhruv Batra, Sehoon Ha
TL;DR
HM3D-OVON proposes a large-scale open-vocabulary ObjectNav benchmark built on HM3DSem, expanding to 379 object categories and real-world-like scans to assess generalization to free-form language goals. The authors compare end-to-end (BC, DAgger, RL, DAgRL) and modular approaches (VLFM, OD-augmented variants), finding that DAgRL with frontier exploration and RL fine-tuning achieves top end-to-end performance, while modular VLFM generalizes better to unseen categories thanks to an open-vocabulary detector. Incorporating an open-vocabulary detector with a goal-directed navigation module (DAgRL+OD) yields the strongest overall results across all evaluation splits, including Val Seen Synonyms and Val Unseen, and demonstrates robustness to test-time noise. The work provides detailed analyses of trajectory strategies, temporal encoders, and failure modes, offering practical guidance for building flexible, robust visual-semantic navigation agents capable of locating objects described in free-form language in real indoor environments.
Abstract
We present the Habitat-Matterport 3D Open Vocabulary Object Goal Navigation dataset (HM3D-OVON), a large-scale benchmark that broadens the scope and semantic range of prior Object Goal Navigation (ObjectNav) benchmarks. Leveraging the HM3DSem dataset, HM3D-OVON incorporates over 15k annotated instances of household objects across 379 distinct categories, derived from photo-realistic 3D scans of real-world environments. In contrast to earlier ObjectNav datasets, which limit goal objects to a predefined set of 6-20 categories, HM3D-OVON facilitates the training and evaluation of models with an open-set of goals defined through free-form language at test-time. Through this open-vocabulary formulation, HM3D-OVON encourages progress towards learning visuo-semantic navigation behaviors that are capable of searching for any object specified by text in an open-vocabulary manner. Additionally, we systematically evaluate and compare several different types of approaches on HM3D-OVON. We find that HM3D-OVON can be used to train an open-vocabulary ObjectNav agent that achieves both higher performance and is more robust to localization and actuation noise than the state-of-the-art ObjectNav approach. We hope that our benchmark and baseline results will drive interest in developing embodied agents that can navigate real-world spaces to find household objects specified through free-form language, taking a step towards more flexible and human-like semantic visual navigation. Code and videos available at: naoki.io/ovon.
