POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images

Antonin Vobecky; Oriane Siméoni; David Hurych; Spyros Gidaris; Andrei Bursuc; Patrick Pérez; Josef Sivic

POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images

Antonin Vobecky, Oriane Siméoni, David Hurych, Spyros Gidaris, Andrei Bursuc, Patrick Pérez, Josef Sivic

TL;DR

POP-3D tackles open-vocabulary 3D occupancy prediction from images using a tri-modal self-supervised framework that leverages images, LiDAR, and a pre-trained image-language model. It introduces a two-head architecture: a class-agnostic occupancy head and a 3D-language head producing text-aligned voxel embeddings, trained with $L_{occ}$ and $L_{ft}$ losses and a final $L = L_{occ} + \lambda L_{ft}$. It achieves zero-shot 3D semantic segmentation and language-driven 3D grounding without needing 3D language labels or LiDAR at test time, and it is evaluated on nuScenes with a new retrieval benchmark showing competitive performance against fully supervised baselines and improvements over MaskCLIP+. This work enables scalable 3D scene understanding using natural language queries for autonomous systems.

Abstract

We describe an approach to predict open-vocabulary 3D semantic voxel occupancy map from input 2D images with the objective of enabling 3D grounding, segmentation and retrieval of free-form language queries. This is a challenging problem because of the 2D-3D ambiguity and the open-vocabulary nature of the target tasks, where obtaining annotated training data in 3D is difficult. The contributions of this work are three-fold. First, we design a new model architecture for open-vocabulary 3D semantic occupancy prediction. The architecture consists of a 2D-3D encoder together with occupancy prediction and 3D-language heads. The output is a dense voxel map of 3D grounded language embeddings enabling a range of open-vocabulary tasks. Second, we develop a tri-modal self-supervised learning algorithm that leverages three modalities: (i) images, (ii) language and (iii) LiDAR point clouds, and enables training the proposed architecture using a strong pre-trained vision-language model without the need for any 3D manual language annotations. Finally, we demonstrate quantitatively the strengths of the proposed model on several open-vocabulary tasks: Zero-shot 3D semantic segmentation using existing datasets; 3D grounding and retrieval of free-form language queries, using a small dataset that we propose as an extension of nuScenes. You can find the project page here https://vobecant.github.io/POP3D.

POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images

TL;DR

and

losses and a final

. It achieves zero-shot 3D semantic segmentation and language-driven 3D grounding without needing 3D language labels or LiDAR at test time, and it is evaluated on nuScenes with a new retrieval benchmark showing competitive performance against fully supervised baselines and improvements over MaskCLIP+. This work enables scalable 3D scene understanding using natural language queries for autonomous systems.

Abstract

Paper Structure (23 sections, 10 equations, 12 figures, 4 tables)

This paper contains 23 sections, 10 equations, 12 figures, 4 tables.

Introduction
Related work
Open-vocabulary 3D occupancy prediction
Architecture for open-vocabulary 3D occupancy prediction
Tri-modal self-supervised training
3D open-vocabulary test-time inference
Experiments
Experimental setup
Comparison to the state of the art
Sensitivity analysis
Loss weight $\lambda$.
Input resolution and image backbone.
Depth of prediction head.
Demonstration of open-vocabulary capabilities
Limitations.
...and 8 more sections

Figures (12)

Figure 1: Overview of the proposed method. Provided only with surround-view images as input, our model called POP-3D produces a voxel grid of 3D text-aligned features that support open-vocabulary downstream tasks such as zero-shot occupancy segmentation or text-based grounding and retrieval.
Figure 2: Proposed approach. In (a), we show the architecture of the proposed method. Having only surround-view images on the input, the model first extracts a dense voxel feature grid that is then fed to two parallel heads: occupancy head $g$ producing voxel-level occupancy predictions, and to 3D-language feature head $h$ which outputs features aligned with text representations. In b), we show how we train our approach, namely the occupancy loss $\mathcal{L}_\text{occ}$ used to train class-agnostic occupancy predictions, and the feature loss $\mathcal{L}_\text{ft}$ that enforces the 3D-language head $h$ to output features aligned with text representations.
Figure 3: Validation labels: blue = free, red = occupied, and gray = ignored voxels.
Figure 4: Comparison to the state of the art. We compare our POP-3D approach to different baselines using (a) the LiDAR-based evaluation, (b) occupancy evaluation, and (c) open-vocabulary language-driven retrieval. In (a), our zero-shot approach POP-3D outperforms the strong MaskCLIP+ zhou2022maskclip (M.CLIP+) baseline while closing the gap to the fully supervised. Other recent methods using supervision and requiring LiDAR points during inference (ODISE xu2023open and OpenScene Peng2023OpenScene) are even better. All methods that require manual annotations during training are denoted by striped bars). In (b), our zero-shot approach POP-3D surpasses the fully-supervised model huang2023tri on occupancy prediction (IoU) while reaching $78\%$ of its performance on semantic occupancy segmentation (mIoU). Finally, in (c), we present results of open-vocabulary language-driven retrieval on our newly composed dataset, where we compare our approach to the MaskCLIP+ baseline. We measure mAP on manually annotated LiDAR 3D points in the scene. Our POP-3D outperforms the MaskCLIP+ approach on this task by $3.5$ mAP points.
Figure 5: Qualitative results of zero-shot semantic 3D occupancy prediction on the 16 classes in the nuScenes caesar2020nuscenes validation split. Please note how our method is able to quite accurately localize and segment objects in 3D including road (magenta), vegetation (dark green), cars (blue), or buildings (gray) from only input 2D images and in a zero-shot manner, i.e. only by providing natural language prompts for the target classes. Visualizations are shown on an interpolated 300x300x24 voxel grid.
...and 7 more figures

POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images

TL;DR

Abstract

POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images

Authors

TL;DR

Abstract

Table of Contents

Figures (12)