Table of Contents
Fetching ...

RenderOcc: Vision-Centric 3D Occupancy Prediction with 2D Rendering Supervision

Mingjie Pan, Jiaming Liu, Renrui Zhang, Peixiang Huang, Xiaoqi Li, Bing Wang, Hongwei Xie, Li Liu, Shanghang Zhang

TL;DR

RenderOcc demonstrates that a vision-centric 3D occupancy model can be trained using only 2D labels by learning a NeRF-style semantic density field and performing volume rendering to generate 2D semantic and depth supervision. The method introduces Auxiliary Rays from adjacent frames and Weighted Ray Sampling to address sparse viewpoints and training efficiency, enabling multi-view consistency in autonomous-driving scenarios. Across NuScenes and SemanticKiTTI, RenderOcc achieves competitive performance with 3D-label-supervised baselines and shows particular strength in static background regions, highlighting a practical path to scalable, image-based 3D occupancy learning. This work suggests a viable route to reduce annotation burden while preserving accurate 3D scene understanding for robot perception and autonomous driving.

Abstract

3D occupancy prediction holds significant promise in the fields of robot perception and autonomous driving, which quantifies 3D scenes into grid cells with semantic labels. Recent works mainly utilize complete occupancy labels in 3D voxel space for supervision. However, the expensive annotation process and sometimes ambiguous labels have severely constrained the usability and scalability of 3D occupancy models. To address this, we present RenderOcc, a novel paradigm for training 3D occupancy models only using 2D labels. Specifically, we extract a NeRF-style 3D volume representation from multi-view images, and employ volume rendering techniques to establish 2D renderings, thus enabling direct 3D supervision from 2D semantics and depth labels. Additionally, we introduce an Auxiliary Ray method to tackle the issue of sparse viewpoints in autonomous driving scenarios, which leverages sequential frames to construct comprehensive 2D rendering for each object. To our best knowledge, RenderOcc is the first attempt to train multi-view 3D occupancy models only using 2D labels, reducing the dependence on costly 3D occupancy annotations. Extensive experiments demonstrate that RenderOcc achieves comparable performance to models fully supervised with 3D labels, underscoring the significance of this approach in real-world applications.

RenderOcc: Vision-Centric 3D Occupancy Prediction with 2D Rendering Supervision

TL;DR

RenderOcc demonstrates that a vision-centric 3D occupancy model can be trained using only 2D labels by learning a NeRF-style semantic density field and performing volume rendering to generate 2D semantic and depth supervision. The method introduces Auxiliary Rays from adjacent frames and Weighted Ray Sampling to address sparse viewpoints and training efficiency, enabling multi-view consistency in autonomous-driving scenarios. Across NuScenes and SemanticKiTTI, RenderOcc achieves competitive performance with 3D-label-supervised baselines and shows particular strength in static background regions, highlighting a practical path to scalable, image-based 3D occupancy learning. This work suggests a viable route to reduce annotation burden while preserving accurate 3D scene understanding for robot perception and autonomous driving.

Abstract

3D occupancy prediction holds significant promise in the fields of robot perception and autonomous driving, which quantifies 3D scenes into grid cells with semantic labels. Recent works mainly utilize complete occupancy labels in 3D voxel space for supervision. However, the expensive annotation process and sometimes ambiguous labels have severely constrained the usability and scalability of 3D occupancy models. To address this, we present RenderOcc, a novel paradigm for training 3D occupancy models only using 2D labels. Specifically, we extract a NeRF-style 3D volume representation from multi-view images, and employ volume rendering techniques to establish 2D renderings, thus enabling direct 3D supervision from 2D semantics and depth labels. Additionally, we introduce an Auxiliary Ray method to tackle the issue of sparse viewpoints in autonomous driving scenarios, which leverages sequential frames to construct comprehensive 2D rendering for each object. To our best knowledge, RenderOcc is the first attempt to train multi-view 3D occupancy models only using 2D labels, reducing the dependence on costly 3D occupancy annotations. Extensive experiments demonstrate that RenderOcc achieves comparable performance to models fully supervised with 3D labels, underscoring the significance of this approach in real-world applications.
Paper Structure (15 sections, 9 equations, 5 figures, 5 tables)

This paper contains 15 sections, 9 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: RenderOcc represents a new training paradigm. Unlike previous works that focus on supervising with costly 3D occupancy labels, our proposed RenderOcc utilizes 2D labels to train the 3D occupancy network. Through 2D rendering supervision, the model benefits from fine-grained 2D pixel-level semantic and depth supervision.
  • Figure 2: Overall framework of RenderOcc. We extract volume features $V$ and predict density $\sigma$ and semantic $S$ for each voxel through a 2D-to-3D network. As a result, we generate the Semantic Density Field, which can perform volume rendering to generate rendered 2D semantics and depth $\{S^{pix},D^{pix}\}$. For the generation of Rays GT, we extract auxiliary rays from adjacent frames to supplement the rays of the current frame, and purify them using the proposed Weighted Ray Sampling strategy. Then, we calculate the loss with rays GT and $\{S^{pix},D^{pix}\}$, achieving rendering supervision with 2D labels.
  • Figure 3: Auxiliary Rays: Images from single frame cannot capture multi-view information of objects well. There is only a small overlap area between two adjacent cameras, and the difference in perspective is limited. By introducing auxiliary rays from adjacent frames, the model will significantly benefit from multi-view consistency constraints.
  • Figure 4: Qualitative results on NuScenes. Compared to the baseline that uses 3D labels for supervision, our proposed RenderOcc exhibits a more acute perception of object boundaries and small objects as shown in the red boxes. The crane’s arm in the image is finely perceived by RenderOcc, while BEVStereo supervised by 3D labels fails to perceive the arm floating in the air. At the same time, RenderOcc successfully identifies distant traffic cones that the baseline overlooks.
  • Figure 5: Ablation Study For Auxiliary-Ray. (a) With the increased utilization of adjacent frames, there is a corresponding rise in mIoU. (b) Weighted Ray Sampling (WRS) effectively mitigates the additional training cost associated with auxiliary rays while improving performance.