LangOcc: Self-Supervised Open Vocabulary Occupancy Estimation via Volume Rendering

Simon Boeder; Fabian Gigengack; Benjamin Risse

LangOcc: Self-Supervised Open Vocabulary Occupancy Estimation via Volume Rendering

Simon Boeder, Fabian Gigengack, Benjamin Risse

TL;DR

LangOcc presents a self-supervised framework that distills CLIP into a 3D occupancy model via differentiable volume rendering, enabling open vocabulary semantics without 3D labels. By predicting per-voxel occupancy $V_\sigma$ and vision-language features $V_\psi$ in a 3D voxel grid and supervising through 2D feature rendering, it learns geometry and semantics jointly from images. A temporal rendering strategy and a feature subspace reduction allow robust training and efficient inference, achieving state-of-the-art results on open vocabulary occupancy and self-supervised semantic occupancy (Occ3D-nuScenes). The approach demonstrates that strong 3D scene understanding can be achieved with vision-language supervision alone, offering scalable, pervious-world perception without predefined semantic categories.

Abstract

The 3D occupancy estimation task has become an important challenge in the area of vision-based autonomous driving recently. However, most existing camera-based methods rely on costly 3D voxel labels or LiDAR scans for training, limiting their practicality and scalability. Moreover, most methods are tied to a predefined set of classes which they can detect. In this work we present a novel approach for open vocabulary occupancy estimation called LangOcc, that is trained only via camera images, and can detect arbitrary semantics via vision-language alignment. In particular, we distill the knowledge of the strong vision-language aligned encoder CLIP into a 3D occupancy model via differentiable volume rendering. Our model estimates vision-language aligned features in a 3D voxel grid using only images. It is trained in a self-supervised manner by rendering our estimations back to 2D space, where ground-truth features can be computed. This training mechanism automatically supervises the scene geometry, allowing for a straight-forward and powerful training method without any explicit geometry supervision. LangOcc outperforms LiDAR-supervised competitors in open vocabulary occupancy by a large margin, solely relying on vision-based training. We also achieve state-of-the-art results in self-supervised semantic occupancy estimation on the Occ3D-nuScenes dataset, despite not being limited to a specific set of categories, thus demonstrating the effectiveness of our proposed vision-language training.

LangOcc: Self-Supervised Open Vocabulary Occupancy Estimation via Volume Rendering

TL;DR

and vision-language features

in a 3D voxel grid and supervising through 2D feature rendering, it learns geometry and semantics jointly from images. A temporal rendering strategy and a feature subspace reduction allow robust training and efficient inference, achieving state-of-the-art results on open vocabulary occupancy and self-supervised semantic occupancy (Occ3D-nuScenes). The approach demonstrates that strong 3D scene understanding can be achieved with vision-language supervision alone, offering scalable, pervious-world perception without predefined semantic categories.

Abstract

Paper Structure (31 sections, 5 equations, 5 figures, 7 tables)

This paper contains 31 sections, 5 equations, 5 figures, 7 tables.

Introduction
Related Work
Camera-based 3D Object Detection
Camera-based 3D Occupancy Estimation
Open Vocabulary Perception
Methodology
Problem Definition
Model Architecture
2D-to-3D Encoder
3D Head
Volume Rendering Supervision
Loss Function
Temporal Rendering
Feature Subspace Learning
Inference
...and 16 more sections

Figures (5)

Figure 1: Architecture of the proposed model. A set of images is first transformed to 3D voxel features via BEVStereo li2023bevstereo and a 3D CNN decoder. Next, two separate heads estimate the density probabilities and the generic scene semantics as vision-language features. The model is trained via differentiable volume rendering, using a loss between rendered estimated features and precomputed 2D features from MaskCLIP zhou2022extract. Optionally, to increase training efficiency and performance at the cost of expressiveness, feature subspace learning can be applied using a predefined vocabulary.
Figure 2: Qualitative results showing open vocabulary retrieval on nuScenes caesar2020nuscenes. Given a text query, we compute similarities between the text embedding and each estimated voxel embedding and highlight voxels with a high similarity score. Ego vehicle shown in white.
Figure B.1: Qualitative results showing zeros-shot semantic occupancy estimations.
Figure B.2: Qualitative results showing zero-shot semantic occupancy estimations.
Figure B.3: Qualitative results depicting rendered estimated 3D features and ground truth features in 2D image space. As is visible, given just the input image, our model can replicate the original CLIP embeddings accurately. However, our model estimates them in full 3D space.

LangOcc: Self-Supervised Open Vocabulary Occupancy Estimation via Volume Rendering

TL;DR

Abstract

LangOcc: Self-Supervised Open Vocabulary Occupancy Estimation via Volume Rendering

Authors

TL;DR

Abstract

Table of Contents

Figures (5)