Table of Contents
Fetching ...

O3N: Omnidirectional Open-Vocabulary Occupancy Prediction

Mengfei Duan, Hao Shi, Fei Teng, Guoqiang Zhao, Yuheng Zhang, Zhiyong Li, Kailun Yang

Abstract

Understanding and reconstructing the 3D world through omnidirectional perception is an inevitable trend in the development of autonomous agents and embodied intelligence. However, existing 3D occupancy prediction methods are constrained by limited perspective inputs and predefined training distribution, making them difficult to apply to embodied agents that require comprehensive and safe perception of scenes in open world exploration. To address this, we present O3N, the first purely visual, end-to-end Omnidirectional Open-vocabulary Occupancy predictioN framework. O3N embeds omnidirectional voxels in a polar-spiral topology via the Polar-spiral Mamba (PsM) module, enabling continuous spatial representation and long-range context modeling across 360°. The Occupancy Cost Aggregation (OCA) module introduces a principled mechanism for unifying geometric and semantic supervision within the voxel space, ensuring consistency between the reconstructed geometry and the underlying semantic structure. Moreover, Natural Modality Alignment (NMA) establishes a gradient-free alignment pathway that harmonizes visual features, voxel embeddings, and text semantics, forming a consistent "pixel-voxel-text" representation triad. Extensive experiments on multiple models demonstrate that our method not only achieves state-of-the-art performance on QuadOcc and Human360Occ benchmarks but also exhibits remarkable cross-scene generalization and semantic scalability, paving the way toward universal 3D world modeling. The source code will be made publicly available at https://github.com/MengfeiD/O3N.

O3N: Omnidirectional Open-Vocabulary Occupancy Prediction

Abstract

Understanding and reconstructing the 3D world through omnidirectional perception is an inevitable trend in the development of autonomous agents and embodied intelligence. However, existing 3D occupancy prediction methods are constrained by limited perspective inputs and predefined training distribution, making them difficult to apply to embodied agents that require comprehensive and safe perception of scenes in open world exploration. To address this, we present O3N, the first purely visual, end-to-end Omnidirectional Open-vocabulary Occupancy predictioN framework. O3N embeds omnidirectional voxels in a polar-spiral topology via the Polar-spiral Mamba (PsM) module, enabling continuous spatial representation and long-range context modeling across 360°. The Occupancy Cost Aggregation (OCA) module introduces a principled mechanism for unifying geometric and semantic supervision within the voxel space, ensuring consistency between the reconstructed geometry and the underlying semantic structure. Moreover, Natural Modality Alignment (NMA) establishes a gradient-free alignment pathway that harmonizes visual features, voxel embeddings, and text semantics, forming a consistent "pixel-voxel-text" representation triad. Extensive experiments on multiple models demonstrate that our method not only achieves state-of-the-art performance on QuadOcc and Human360Occ benchmarks but also exhibits remarkable cross-scene generalization and semantic scalability, paving the way toward universal 3D world modeling. The source code will be made publicly available at https://github.com/MengfeiD/O3N.
Paper Structure (27 sections, 12 equations, 14 figures, 5 tables)

This paper contains 27 sections, 12 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: (a) Omnidirectional Open-vocabulary Occupancy Prediction. The proposed O3N is capable of predicting semantic categories that were not labeled during training by providing only an omnidirectional RGB image as input, along with the required class name text. For example, a closed-set semantic occupancy prediction model might misclassify a box as road or a dog as a bicycle, whereupon the corresponding class names can be provided to enable accurate predictions. (b) Results on the QuadOcc benchmark. O3N reaches $16.54$ mIoU, and $21.16$Novel mIoU, respectively, achieving state-of-the-art performance.
  • Figure 2: (a) Omnidirectional open-vocabulary occupancy prediction aims to perform open-vocabulary 3D occupancy prediction solely based on pure omnidirectional visual perception; (b) Regions distant from the viewpoint occupy a smaller proportion of pixels in the image due to the perspective effect and latitude distortion inherent in the equirectangular (ERP) projection; (c) Projection of voxel centers in 3D space onto an omnidirectional image. The projection points become increasingly dense with growing distance from the voxel centers to the imaging viewpoint.
  • Figure 3: O3N architecture. O3N takes equirectangular omnidirectional image as input and is fully end-to-end trained. The image features and text embeddings are pre-extracted via the language-image encoder. The 3D decoder, enhanced with the proposed Polar-spiral Mamba (PsM) module, captures both geometric and semantic dependencies across directions. The Occupancy Cost Aggregation (OCA) and Natural Modality Alignment (NMA) modules integrate pixel, voxel, and text modalities to achieve consistent open-vocabulary semantic reasoning in 3D space. We efficiently train the O3N through semantic occupancy of visible classes, occupancy cost aggregation, and voxel-pixel consistency during training.
  • Figure 4: Polar-spiral Mamba module utilizes a dual-branch architecture to effectively model the spatial structure of omnidirectional images. P-SMamba scans the space in an outward spiral pattern, precisely capturing variations in information density within polar regions. Voxel features are aggregated progressively between polar and Cartesian coordinates to generate comprehensive features that maintain geometric and semantic continuity.
  • Figure 5: Qualitative results. O3N more effectively maintains the clarity and continuity of global geometry and semantics, and achieves significant improvements over the baseline in terms of perception and generalization to unknown semantics.
  • ...and 9 more figures