Table of Contents
Fetching ...

SGR-OCC: Evolving Monocular Priors for Embodied 3D Occupancy Prediction via Soft-Gating Lifting and Semantic-Adaptive Geometric Refinement

Yiran Guo, Simone Mentasti, Xiaofeng Jin, Matteo Frosi, Matteo Matteucci

Abstract

3D semantic occupancy prediction is a cornerstone for embodied AI, enabling agents to perceive dense scene geometry and semantics incrementally from monocular video streams. However, current online frameworks face two critical bottlenecks: the inherent depth ambiguity of monocular estimation that causes "feature bleeding" at object boundaries , and the "cold start" instability where uninitialized temporal fusion layers distort high-quality spatial priors during early training stages. In this paper, we propose SGR-OCC (Soft-Gating and Ray-refinement Occupancy), a unified framework driven by the philosophy of "Inheritance and Evolution". To perfectly inherit monocular spatial expertise, we introduce a Soft-Gating Feature Lifter that explicitly models depth uncertainty via a Gaussian gate to probabilistically suppress background noise. Furthermore, a Dynamic Ray-Constrained Anchor Refinement module simplifies complex 3D displacement searches into efficient 1D depth corrections along camera rays, ensuring sub-voxel adherence to physical surfaces. To ensure stable evolution toward temporal consistency, we employ a Two-Phase Progressive Training Strategy equipped with identity-initialized fusion, effectively resolving the cold start problem and shielding spatial priors from noisy early gradients. Extensive experiments on the EmbodiedOcc-ScanNet and Occ-ScanNet benchmarks demonstrate that SGR-OCC achieves state-of-the-art performance. In local prediction tasks, SGR-OCC achieves a completion IoU of 58.55$\%$ and a semantic mIoU of 49.89$\%$, surpassing the previous best method, EmbodiedOcc++, by 3.65$\%$ and 3.69$\%$ respectively. In challenging embodied prediction tasks, our model reaches 55.72$\%$ SC-IoU and 46.22$\%$ mIoU. Qualitative results further confirm our model's superior capability in preserving structural integrity and boundary sharpness in complex indoor environments.

SGR-OCC: Evolving Monocular Priors for Embodied 3D Occupancy Prediction via Soft-Gating Lifting and Semantic-Adaptive Geometric Refinement

Abstract

3D semantic occupancy prediction is a cornerstone for embodied AI, enabling agents to perceive dense scene geometry and semantics incrementally from monocular video streams. However, current online frameworks face two critical bottlenecks: the inherent depth ambiguity of monocular estimation that causes "feature bleeding" at object boundaries , and the "cold start" instability where uninitialized temporal fusion layers distort high-quality spatial priors during early training stages. In this paper, we propose SGR-OCC (Soft-Gating and Ray-refinement Occupancy), a unified framework driven by the philosophy of "Inheritance and Evolution". To perfectly inherit monocular spatial expertise, we introduce a Soft-Gating Feature Lifter that explicitly models depth uncertainty via a Gaussian gate to probabilistically suppress background noise. Furthermore, a Dynamic Ray-Constrained Anchor Refinement module simplifies complex 3D displacement searches into efficient 1D depth corrections along camera rays, ensuring sub-voxel adherence to physical surfaces. To ensure stable evolution toward temporal consistency, we employ a Two-Phase Progressive Training Strategy equipped with identity-initialized fusion, effectively resolving the cold start problem and shielding spatial priors from noisy early gradients. Extensive experiments on the EmbodiedOcc-ScanNet and Occ-ScanNet benchmarks demonstrate that SGR-OCC achieves state-of-the-art performance. In local prediction tasks, SGR-OCC achieves a completion IoU of 58.55 and a semantic mIoU of 49.89, surpassing the previous best method, EmbodiedOcc++, by 3.65 and 3.69 respectively. In challenging embodied prediction tasks, our model reaches 55.72 SC-IoU and 46.22 mIoU. Qualitative results further confirm our model's superior capability in preserving structural integrity and boundary sharpness in complex indoor environments.
Paper Structure (33 sections, 10 equations, 11 figures, 11 tables)

This paper contains 33 sections, 10 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: The overall architecture of our proposed framework. The pipeline consists of a Soft-Gating Feature Lifter for robust 2D-to-3D projection and a Dynamic Ray-Constrained Anchor Refinement module for sub-voxel geometric correction. The Semantic-Adaptive GRM (right) enforces category-specific geometric constraints. The bottom-right panel illustrates our Two-Phase Progressive Training Strategy: Stage 1 freezes the monocular backbone to initialize temporal alignment (training only the Global Head), while Stage 2 unfreezes the entire network for global co-adaptation using both Local and Global supervision.
  • Figure 2: Overview of the core geometric modules.(a) The Soft-Gating mechanism evaluates depth consistency ($d_{proj}$ vs. $d_{pred}$) within a sampling window. It applies Gaussian weights to foreground regions while truncating conflicting background noise, ensuring robust feature aggregation. (b) The Refinement module restricts anchor optimization ($\boldsymbol{P}_{init} \to \boldsymbol{P}_{refined}$) to a 1D depth residual $\Delta d$ strictly along the camera ray $\vec{\boldsymbol{r}}$, adhering points to the physical surface and converting a complex 3D search into an efficient 1D correction.
  • Figure 3: Qualitative comparison of local occupancy prediction on Occ-ScanNet. From left to right: Input RGB images, EmbodiedOcc wu2025embodiedocc, EmbodiedOcc++ wang2025embodiedocc++, our proposed method, and the Ground Truth. Our approach exhibits superior boundary sharpness and structural integrity, particularly in complex geometries like the sink (row 3) and table edges (row 2), thanks to the Soft-Gating mechanism and Ray-Constrained Refinement.
  • Figure 4: Qualitative results of embodied occupancy prediction on EmbodiedOcc-ScanNet. The rows display (top to bottom): the ground truth occupancy maps, our SGR-OCC predictions, and the underlying Gaussian Memory accumulation. The columns represent the progressive reconstruction of the global scene across a 30-frame sequence. Our model successfully integrates sequential observations into a coherent Gaussian Memory (bottom row), allowing for stable and high-fidelity global occupancy prediction (middle row) that demonstrates strong structural consistency with the ground truth.
  • Figure 5: Effectiveness of the proposed SGR-OCC components.(a) Qualitative comparison of anchor refinement: Our Ray-Constrained Refinement ($R^1$) reduces geometric drift compared to the unconstrained baseline ($R^3$), forcing anchors to adhere strictly to physical surfaces (indicated by cooler colors). (b) Quantitative analysis of training stability: Compared to E2E (End-to-End) baselines, our Two-Phase Progressive Training effectively bypasses the cold start drop, leading to faster convergence and a superior mIoU of 46.22%.
  • ...and 6 more figures