POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images
Antonin Vobecky, Oriane Siméoni, David Hurych, Spyros Gidaris, Andrei Bursuc, Patrick Pérez, Josef Sivic
TL;DR
POP-3D tackles open-vocabulary 3D occupancy prediction from images using a tri-modal self-supervised framework that leverages images, LiDAR, and a pre-trained image-language model. It introduces a two-head architecture: a class-agnostic occupancy head and a 3D-language head producing text-aligned voxel embeddings, trained with $L_{occ}$ and $L_{ft}$ losses and a final $L = L_{occ} + \lambda L_{ft}$. It achieves zero-shot 3D semantic segmentation and language-driven 3D grounding without needing 3D language labels or LiDAR at test time, and it is evaluated on nuScenes with a new retrieval benchmark showing competitive performance against fully supervised baselines and improvements over MaskCLIP+. This work enables scalable 3D scene understanding using natural language queries for autonomous systems.
Abstract
We describe an approach to predict open-vocabulary 3D semantic voxel occupancy map from input 2D images with the objective of enabling 3D grounding, segmentation and retrieval of free-form language queries. This is a challenging problem because of the 2D-3D ambiguity and the open-vocabulary nature of the target tasks, where obtaining annotated training data in 3D is difficult. The contributions of this work are three-fold. First, we design a new model architecture for open-vocabulary 3D semantic occupancy prediction. The architecture consists of a 2D-3D encoder together with occupancy prediction and 3D-language heads. The output is a dense voxel map of 3D grounded language embeddings enabling a range of open-vocabulary tasks. Second, we develop a tri-modal self-supervised learning algorithm that leverages three modalities: (i) images, (ii) language and (iii) LiDAR point clouds, and enables training the proposed architecture using a strong pre-trained vision-language model without the need for any 3D manual language annotations. Finally, we demonstrate quantitatively the strengths of the proposed model on several open-vocabulary tasks: Zero-shot 3D semantic segmentation using existing datasets; 3D grounding and retrieval of free-form language queries, using a small dataset that we propose as an extension of nuScenes. You can find the project page here https://vobecant.github.io/POP3D.
