Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes

Changqing Zhou; Yueru Luo; Han Zhang; Zeyu Jiang; Changhao Chen

Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes

Changqing Zhou, Yueru Luo, Han Zhang, Zeyu Jiang, Changhao Chen

TL;DR

This framework builds upon 3D Language-Embedded Gaussians, which serve as a unified intermediate representation coupling fine-grained 3D geometry with a language-aligned semantic embedding, and introduces an opacity-aware, Poisson-based approach that stabilizes volumetric aggregation.

Abstract

Open-vocabulary 3D occupancy is vital for embodied agents, which need to understand complex indoor environments where semantic categories are abundant and evolve beyond fixed taxonomies. While recent work has explored open-vocabulary occupancy in outdoor driving scenarios, such methods transfer poorly indoors, where geometry is denser, layouts are more intricate, and semantics are far more fine-grained. To address these challenges, we adopt a geometry-only supervision paradigm that uses only binary occupancy labels (occupied vs free). Our framework builds upon 3D Language-Embedded Gaussians, which serve as a unified intermediate representation coupling fine-grained 3D geometry with a language-aligned semantic embedding. On the geometry side, we find that existing Gaussian-to-Occupancy operators fail to converge under such weak supervision, and we introduce an opacity-aware, Poisson-based approach that stabilizes volumetric aggregation. On the semantic side, direct alignment between rendered features and open-vocabulary segmentation features suffers from feature mixing; we therefore propose a Progressive Temperature Decay schedule that gradually sharpens opacities during splatting, strengthening Gaussian-language alignment. On Occ-ScanNet, our framework achieves 59.50 IoU and 21.05 mIoU in the open-vocabulary setting, surpassing all existing occupancy methods in IoU and outperforming prior open-vocabulary approaches by a large margin in mIoU. Code will be released at https://github.com/JuIvyy/LegoOcc.

Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes

TL;DR

Abstract

Paper Structure (18 sections, 12 equations, 5 figures, 4 tables)

This paper contains 18 sections, 12 equations, 5 figures, 4 tables.

Introduction
Related works
Occupancy Prediction
Open Vocabulary Occupancy Prediction
Method
Problem Setting
LegoOcc Framework Overview
Poisson-based Gaussian-to-Occupancy
A Poisson approach.
Progressive Temperature Decay
Losses
Experiments
Datasets and Metrics
Implementation Details
Main Results
...and 3 more sections

Figures (5)

Figure 1: Closed- vs. open-vocabulary occupancy. Prior methods ISOembodiedocc trained under a closed vocabulary can label only the categories predefined at training time, which restricts real-world deployment. Our open-vocabulary approach aligns language with 3D occupancy and supports text queries for arbitrary categories. Right column (Random Class): text-conditioned per-voxel scores are visualized as heatmaps; darker red indicates higher likelihood for the queried category.
Figure 2: LegoOcc Framework Overview. From a monocular image, a feed-forward Gaussian model produces Language-Embedded Gaussians. Training proceeds along two couched paths: Semantic learning, we differentiably render Gaussian features to the image with Progressive Temperature Decay and align them to a training-free open-vocabulary segmenter via a cosine objective $L_{\text{feat}}$; Geometry learning, we convert Gaussians to occupancy using a opacity-aware Poisson-based Gaussian-to-Occupancy operator and supervise binary occupancy with $L_{\text{occ}}$. At inference, the Language-Embedded Occupancy supports text-driven queries by computing cosine similarity between voxel embeddings and prompt embeddings, yielding open-vocabulary semantic occupancy without dense voxel-level semantic labels during training.
Figure 3: Comparison of temperature schedules. Linear decay decreases $\tau$ uniformly, whereas our exponential schedule rapidly approaches $T_{\min}$, allocating more iterations for the model to adapt to the low-temperature regime.
Figure 4: Qualitative results on Occ-ScanNet. From top to bottom: (a) input images; (b) ground-truth semantic occupancy; (c) results from our re-implemented LOcc yu2025language; (d) our method. Both (c) and (d) are trained with geometry-only annotations and evaluated on the closed-vocabulary annotation of Occ-ScanNet.
Figure 5: Open-vocabulary qualitative results. Legends list the VLM-extracted object nouns used as text queries. (a) Input image. (b) Open-vocabulary 2D segmentation for queried nouns. (c) Our 3D open-vocabulary occupancy colored by the same categories.

Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes

TL;DR

Abstract

Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes

Authors

TL;DR

Abstract

Table of Contents

Figures (5)