Table of Contents
Fetching ...

LangOcc: Self-Supervised Open Vocabulary Occupancy Estimation via Volume Rendering

Simon Boeder, Fabian Gigengack, Benjamin Risse

TL;DR

LangOcc presents a self-supervised framework that distills CLIP into a 3D occupancy model via differentiable volume rendering, enabling open vocabulary semantics without 3D labels. By predicting per-voxel occupancy $V_\sigma$ and vision-language features $V_\psi$ in a 3D voxel grid and supervising through 2D feature rendering, it learns geometry and semantics jointly from images. A temporal rendering strategy and a feature subspace reduction allow robust training and efficient inference, achieving state-of-the-art results on open vocabulary occupancy and self-supervised semantic occupancy (Occ3D-nuScenes). The approach demonstrates that strong 3D scene understanding can be achieved with vision-language supervision alone, offering scalable, pervious-world perception without predefined semantic categories.

Abstract

The 3D occupancy estimation task has become an important challenge in the area of vision-based autonomous driving recently. However, most existing camera-based methods rely on costly 3D voxel labels or LiDAR scans for training, limiting their practicality and scalability. Moreover, most methods are tied to a predefined set of classes which they can detect. In this work we present a novel approach for open vocabulary occupancy estimation called LangOcc, that is trained only via camera images, and can detect arbitrary semantics via vision-language alignment. In particular, we distill the knowledge of the strong vision-language aligned encoder CLIP into a 3D occupancy model via differentiable volume rendering. Our model estimates vision-language aligned features in a 3D voxel grid using only images. It is trained in a self-supervised manner by rendering our estimations back to 2D space, where ground-truth features can be computed. This training mechanism automatically supervises the scene geometry, allowing for a straight-forward and powerful training method without any explicit geometry supervision. LangOcc outperforms LiDAR-supervised competitors in open vocabulary occupancy by a large margin, solely relying on vision-based training. We also achieve state-of-the-art results in self-supervised semantic occupancy estimation on the Occ3D-nuScenes dataset, despite not being limited to a specific set of categories, thus demonstrating the effectiveness of our proposed vision-language training.

LangOcc: Self-Supervised Open Vocabulary Occupancy Estimation via Volume Rendering

TL;DR

LangOcc presents a self-supervised framework that distills CLIP into a 3D occupancy model via differentiable volume rendering, enabling open vocabulary semantics without 3D labels. By predicting per-voxel occupancy and vision-language features in a 3D voxel grid and supervising through 2D feature rendering, it learns geometry and semantics jointly from images. A temporal rendering strategy and a feature subspace reduction allow robust training and efficient inference, achieving state-of-the-art results on open vocabulary occupancy and self-supervised semantic occupancy (Occ3D-nuScenes). The approach demonstrates that strong 3D scene understanding can be achieved with vision-language supervision alone, offering scalable, pervious-world perception without predefined semantic categories.

Abstract

The 3D occupancy estimation task has become an important challenge in the area of vision-based autonomous driving recently. However, most existing camera-based methods rely on costly 3D voxel labels or LiDAR scans for training, limiting their practicality and scalability. Moreover, most methods are tied to a predefined set of classes which they can detect. In this work we present a novel approach for open vocabulary occupancy estimation called LangOcc, that is trained only via camera images, and can detect arbitrary semantics via vision-language alignment. In particular, we distill the knowledge of the strong vision-language aligned encoder CLIP into a 3D occupancy model via differentiable volume rendering. Our model estimates vision-language aligned features in a 3D voxel grid using only images. It is trained in a self-supervised manner by rendering our estimations back to 2D space, where ground-truth features can be computed. This training mechanism automatically supervises the scene geometry, allowing for a straight-forward and powerful training method without any explicit geometry supervision. LangOcc outperforms LiDAR-supervised competitors in open vocabulary occupancy by a large margin, solely relying on vision-based training. We also achieve state-of-the-art results in self-supervised semantic occupancy estimation on the Occ3D-nuScenes dataset, despite not being limited to a specific set of categories, thus demonstrating the effectiveness of our proposed vision-language training.
Paper Structure (31 sections, 5 equations, 5 figures, 7 tables)

This paper contains 31 sections, 5 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Architecture of the proposed model. A set of images is first transformed to 3D voxel features via BEVStereo li2023bevstereo and a 3D CNN decoder. Next, two separate heads estimate the density probabilities and the generic scene semantics as vision-language features. The model is trained via differentiable volume rendering, using a loss between rendered estimated features and precomputed 2D features from MaskCLIP zhou2022extract. Optionally, to increase training efficiency and performance at the cost of expressiveness, feature subspace learning can be applied using a predefined vocabulary.
  • Figure 2: Qualitative results showing open vocabulary retrieval on nuScenes caesar2020nuscenes. Given a text query, we compute similarities between the text embedding and each estimated voxel embedding and highlight voxels with a high similarity score. Ego vehicle shown in white.
  • Figure B.1: Qualitative results showing zeros-shot semantic occupancy estimations.
  • Figure B.2: Qualitative results showing zero-shot semantic occupancy estimations.
  • Figure B.3: Qualitative results depicting rendered estimated 3D features and ground truth features in 2D image space. As is visible, given just the input image, our model can replicate the original CLIP embeddings accurately. However, our model estimates them in full 3D space.