Table of Contents
Fetching ...

SD-OVON: A Semantics-aware Dataset and Benchmark Generation Pipeline for Open-Vocabulary Object Navigation in Dynamic Scenes

Dicong Qiu, Jiadi You, Zeying Gong, Ronghe Qiu, Hui Xiong, Junwei Liang

TL;DR

SD-OVON tackles the challenge of open-vocabulary object navigation in dynamic environments by introducing a semantics-aware pipeline that generates infinite photo-realistic scene variants from real-world scans and movable object models, with Habitat integration for automatic episode generation. The approach combines pretrained multimodal models for scene synthesis, open-vocabulary instance extraction, receptacle plane detection, and region-receptacle semantics-driven object placement to create realistic dynamic benchmarks. It provides two pre-generated ObjectNav datasets (SD-OVON-3k and SD-OVON-10k) and establishes two baselines, demonstrating the value of semantics-aware planning in dynamic settings and highlighting the importance of detector quality and final navigation strategies. The work advances open-vocabulary navigation research toward real-world applicability by enabling real-to-sim and sim-to-real evaluation in dynamic spaces and offering publicly available data and code for the community.

Abstract

We present the Semantics-aware Dataset and Benchmark Generation Pipeline for Open-vocabulary Object Navigation in Dynamic Scenes (SD-OVON). It utilizes pretraining multimodal foundation models to generate infinite unique photo-realistic scene variants that adhere to real-world semantics and daily commonsense for the training and the evaluation of navigation agents, accompanied with a plugin for generating object navigation task episodes compatible to the Habitat simulator. In addition, we offer two pre-generated object navigation task datasets, SD-OVON-3k and SD-OVON-10k, comprising respectively about 3k and 10k episodes of the open-vocabulary object navigation task, derived from the SD-OVON-Scenes dataset with 2.5k photo-realistic scans of real-world environments and the SD-OVON-Objects dataset with 0.9k manually inspected scanned and artist-created manipulatable object models. Unlike prior datasets limited to static environments, SD-OVON covers dynamic scenes and manipulatable objects, facilitating both real-to-sim and sim-to-real robotic applications. This approach enhances the realism of navigation tasks, the training and the evaluation of open-vocabulary object navigation agents in complex settings. To demonstrate the effectiveness of our pipeline and datasets, we propose two baselines and evaluate them along with state-of-the-art baselines on SD-OVON-3k. The datasets, benchmark and source code are publicly available.

SD-OVON: A Semantics-aware Dataset and Benchmark Generation Pipeline for Open-Vocabulary Object Navigation in Dynamic Scenes

TL;DR

SD-OVON tackles the challenge of open-vocabulary object navigation in dynamic environments by introducing a semantics-aware pipeline that generates infinite photo-realistic scene variants from real-world scans and movable object models, with Habitat integration for automatic episode generation. The approach combines pretrained multimodal models for scene synthesis, open-vocabulary instance extraction, receptacle plane detection, and region-receptacle semantics-driven object placement to create realistic dynamic benchmarks. It provides two pre-generated ObjectNav datasets (SD-OVON-3k and SD-OVON-10k) and establishes two baselines, demonstrating the value of semantics-aware planning in dynamic settings and highlighting the importance of detector quality and final navigation strategies. The work advances open-vocabulary navigation research toward real-world applicability by enabling real-to-sim and sim-to-real evaluation in dynamic spaces and offering publicly available data and code for the community.

Abstract

We present the Semantics-aware Dataset and Benchmark Generation Pipeline for Open-vocabulary Object Navigation in Dynamic Scenes (SD-OVON). It utilizes pretraining multimodal foundation models to generate infinite unique photo-realistic scene variants that adhere to real-world semantics and daily commonsense for the training and the evaluation of navigation agents, accompanied with a plugin for generating object navigation task episodes compatible to the Habitat simulator. In addition, we offer two pre-generated object navigation task datasets, SD-OVON-3k and SD-OVON-10k, comprising respectively about 3k and 10k episodes of the open-vocabulary object navigation task, derived from the SD-OVON-Scenes dataset with 2.5k photo-realistic scans of real-world environments and the SD-OVON-Objects dataset with 0.9k manually inspected scanned and artist-created manipulatable object models. Unlike prior datasets limited to static environments, SD-OVON covers dynamic scenes and manipulatable objects, facilitating both real-to-sim and sim-to-real robotic applications. This approach enhances the realism of navigation tasks, the training and the evaluation of open-vocabulary object navigation agents in complex settings. To demonstrate the effectiveness of our pipeline and datasets, we propose two baselines and evaluate them along with state-of-the-art baselines on SD-OVON-3k. The datasets, benchmark and source code are publicly available.

Paper Structure

This paper contains 35 sections, 14 equations, 7 figures, 9 tables, 4 algorithms.

Figures (7)

  • Figure 1: Visualization of example scene variants generated by SD-OVON. Manipulatable objects are placed in accordance to daily commonsense, considering both receptacle types and regions.
  • Figure 2: An illustrating of the SD-OVON pipeline. It (a) randomly samples RGB-D observations $o$'s from a scene $s$, (b) extracts and merges open-vocabulary 3D semantic instances $I$'s from the observations, (c) identifies receptacles and available planes $S$'s, and (d) generates scene variants by placing manipulable objects $b$'s at corresponding position $p$'s on appropriate receptacles adhering to daily semantic commonsense.
  • Figure 3: Example trajectories of successful navigation with Semantic A* (left), Random A* (middle) and VLFMyokoyama2024vlfm (right) respectively for an ObjectNav task episode from SD-OVON-3k.
  • Figure 4: The complete statistics of object category appearance frequencies across the 363 scene variants from the SD-OVON-3k dataset.
  • Figure 5: The complete statistics of the navigation goal object category appearance frequencies across the 2897 ObjectNav task episodes from the SD-OVON-3k dataset.
  • ...and 2 more figures