Table of Contents
Fetching ...

VEON: Vocabulary-Enhanced Occupancy Prediction

Jilai Zheng, Pin Tang, Zhongdao Wang, Guoqing Wang, Xiangxuan Ren, Bailan Feng, Chao Ma

TL;DR

VEON tackles open-vocabulary 3D occupancy prediction by reusing two powerful 2D foundation models: MiDaS for depth and CLIP for semantics, bridging them into the 3D voxel space through a depth adaptor and a High-resolution Side Adaptor. The method proceeds in two stages—depth pretraining to obtain metric-bin depth and 3D occupancy prediction with lifted CLIP features, pseudo supervision from SAN, and tail-aware loss—to deliver accurate 3D semantics with minimal trainable parameters. VEON achieves competitive mIoU on Occ3D-nuScenes with about 46M trainable parameters and demonstrates solid open-vocabulary capabilities in retrieval tasks, confirming effective cross-modal 2D-to-3D knowledge transfer. The work highlights the value of leveraging large 2D foundation priors for label-efficient, open-world 3D perception in autonomous driving, while noting limitations from fixed foundation models and suggesting future improvements with more advanced vision-language models.

Abstract

Perceiving the world as 3D occupancy supports embodied agents to avoid collision with any types of obstacle. While open-vocabulary image understanding has prospered recently, how to bind the predicted 3D occupancy grids with open-world semantics still remains under-explored due to limited open-world annotations. Hence, instead of building our model from scratch, we try to blend 2D foundation models, specifically a depth model MiDaS and a semantic model CLIP, to lift the semantics to 3D space, thus fulfilling 3D occupancy. However, building upon these foundation models is not trivial. First, the MiDaS faces the depth ambiguity problem, i.e., it only produces relative depth but fails to estimate bin depth for feature lifting. Second, the CLIP image features lack high-resolution pixel-level information, which limits the 3D occupancy accuracy. Third, open vocabulary is often trapped by the long-tail problem. To address these issues, we propose VEON for Vocabulary-Enhanced Occupancy predictioN by not only assembling but also adapting these foundation models. We first equip MiDaS with a Zoedepth head and low-rank adaptation (LoRA) for relative-metric-bin depth transformation while reserving beneficial depth prior. Then, a lightweight side adaptor network is attached to the CLIP vision encoder to generate high-resolution features for fine-grained 3D occupancy prediction. Moreover, we design a class reweighting strategy to give priority to the tail classes. With only 46M trainable parameters and zero manual semantic labels, VEON achieves 15.14 mIoU on Occ3D-nuScenes, and shows the capability of recognizing objects with open-vocabulary categories, meaning that our VEON is label-efficient, parameter-efficient, and precise enough.

VEON: Vocabulary-Enhanced Occupancy Prediction

TL;DR

VEON tackles open-vocabulary 3D occupancy prediction by reusing two powerful 2D foundation models: MiDaS for depth and CLIP for semantics, bridging them into the 3D voxel space through a depth adaptor and a High-resolution Side Adaptor. The method proceeds in two stages—depth pretraining to obtain metric-bin depth and 3D occupancy prediction with lifted CLIP features, pseudo supervision from SAN, and tail-aware loss—to deliver accurate 3D semantics with minimal trainable parameters. VEON achieves competitive mIoU on Occ3D-nuScenes with about 46M trainable parameters and demonstrates solid open-vocabulary capabilities in retrieval tasks, confirming effective cross-modal 2D-to-3D knowledge transfer. The work highlights the value of leveraging large 2D foundation priors for label-efficient, open-world 3D perception in autonomous driving, while noting limitations from fixed foundation models and suggesting future improvements with more advanced vision-language models.

Abstract

Perceiving the world as 3D occupancy supports embodied agents to avoid collision with any types of obstacle. While open-vocabulary image understanding has prospered recently, how to bind the predicted 3D occupancy grids with open-world semantics still remains under-explored due to limited open-world annotations. Hence, instead of building our model from scratch, we try to blend 2D foundation models, specifically a depth model MiDaS and a semantic model CLIP, to lift the semantics to 3D space, thus fulfilling 3D occupancy. However, building upon these foundation models is not trivial. First, the MiDaS faces the depth ambiguity problem, i.e., it only produces relative depth but fails to estimate bin depth for feature lifting. Second, the CLIP image features lack high-resolution pixel-level information, which limits the 3D occupancy accuracy. Third, open vocabulary is often trapped by the long-tail problem. To address these issues, we propose VEON for Vocabulary-Enhanced Occupancy predictioN by not only assembling but also adapting these foundation models. We first equip MiDaS with a Zoedepth head and low-rank adaptation (LoRA) for relative-metric-bin depth transformation while reserving beneficial depth prior. Then, a lightweight side adaptor network is attached to the CLIP vision encoder to generate high-resolution features for fine-grained 3D occupancy prediction. Moreover, we design a class reweighting strategy to give priority to the tail classes. With only 46M trainable parameters and zero manual semantic labels, VEON achieves 15.14 mIoU on Occ3D-nuScenes, and shows the capability of recognizing objects with open-vocabulary categories, meaning that our VEON is label-efficient, parameter-efficient, and precise enough.
Paper Structure (24 sections, 8 equations, 6 figures, 10 tables)

This paper contains 24 sections, 8 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Main idea of our VEON. Left: Referring to the strong data prior in 2D foundation models, we resort to unleashing their power for handling 3D open-vocabulary tasks. Right: Compared with the conventional practice of training a unified 2D backbone from scratch, we design a decoupled pipeline that assembles and adapts a depth model MiDaS tpami20-midas and a semantic model CLIP icml21-clip, for 3D open-vocabulary occupancy.
  • Figure 2: Framework overview. Our VEON consists of two training stages: depth pretraining and occupancy prediction. Left: In stage $1$, we adapt the MiDaS tpami20-midas backbone with a relative-metric-bin depth transformation adaptor to estimate the bin depth for LSS feature lifting eccv20-lss. Low-rank adaptation (LoRA) iclr21-lora is integrated for enhanced domain transfer. Right: In stage $2$, we unleash the power of CLIP icml21-clip via equipping a High-resolution Side Adaptor (HSA). The refined high-resolution CLIP semantic feature is lifted via LSS and goes through 3D convolutions for 3D occupancy. The network reserves the capability of recognizing open-vocabulary objects by aligning the 3D representation with CLIP language embeddings of certain classes, which is determined by the off-the-shelf 2D open-vocabulary segmentor SAN cvpr23-san.
  • Figure 3: Detailed network architecture of the High-resolution Side Adaptor (HSA). Top: Adaptor architecture. We maintain a series of residual convolution blocks beside the CLIP backbone and extract high-resolution spatial features. It fuses early layers of the CLIP visual tokens and outputs: (1) attention bias ($\textbf{A}$) for refining ViT feature extraction, and (2) supplementary matrix ($\textbf{S}$) for making up high-resolution information. Bottom: Attention bias $\textbf{A}$ manipulates the attention of transformer layers in ViT, and $\textbf{S}$ is fused before outputting the 2D semantic feature $\mathbf{F^{sem}}$ for LSS lifting.
  • Figure 4: Visualization of occupancy prediction (VEON-L) on the Occ3D-nuScenes occupancy benchmark cvpr20-nuscenesarxiv23-occ3d (validation set). We visualize the surrounding images (column $1$), ground truth and predicted occupancy (column $2$-$3$), and the retrieval results of certain open-vocabulary classes (column $4$-$5$). Our VEON-L demonstrates the capability of recognizing unseen objects (in orange), such as stairs, gravel, and road signs.
  • Figure A: Positions for adding attention bias (blank squares).
  • ...and 1 more figures