Table of Contents
Fetching ...

Learning A Zero-shot Occupancy Network from Vision Foundation Models via Self-supervised Adaptation

Sihao Lin, Daqi Liu, Ruochong Fu, Dongrui Liu, Andy Song, Hongwei Xie, Zhihui Li, Bing Wang, Xiaojun Chang

TL;DR

This work tackles 3D occupancy estimation from monocular images without 3D labels by decoupling 3D supervision into image-level semantics and geometry derived from vision foundation models. It leverages zero-shot semantics via CLIP-based generation and converts relative depth from 2D VFMs to metric depth through a two-stage, self-supervised calibration using novel view synthesis. Across nuScenes and SemanticKITTI, the method achieves strong occupancy performance, often surpassing zero-shot metric-depth baselines and approaching supervised methods, particularly in BEV/TPV representations. The results demonstrate a scalable, label-free pathway for vision-centric 3D understanding, with broad implications for reducing annotation burdens and enabling robust 3D perception from 2D data.

Abstract

Estimating the 3D world from 2D monocular images is a fundamental yet challenging task due to the labour-intensive nature of 3D annotations. To simplify label acquisition, this work proposes a novel approach that bridges 2D vision foundation models (VFMs) with 3D tasks by decoupling 3D supervision into an ensemble of image-level primitives, e.g., semantic and geometric components. As a key motivator, we leverage the zero-shot capabilities of vision-language models for image semantics. However, due to the notorious ill-posed problem - multiple distinct 3D scenes can produce identical 2D projections, directly inferring metric depth from a monocular image in a zero-shot manner is unsuitable. In contrast, 2D VFMs provide promising sources of relative depth, which theoretically aligns with metric depth when properly scaled and offset. Thus, we adapt the relative depth derived from VFMs into metric depth by optimising the scale and offset using temporal consistency, also known as novel view synthesis, without access to ground-truth metric depth. Consequently, we project the semantics into 3D space using the reconstructed metric depth, thereby providing 3D supervision. Extensive experiments on nuScenes and SemanticKITTI demonstrate the effectiveness of our framework. For instance, the proposed method surpasses the current state-of-the-art by 3.34% mIoU on nuScenes for voxel occupancy prediction.

Learning A Zero-shot Occupancy Network from Vision Foundation Models via Self-supervised Adaptation

TL;DR

This work tackles 3D occupancy estimation from monocular images without 3D labels by decoupling 3D supervision into image-level semantics and geometry derived from vision foundation models. It leverages zero-shot semantics via CLIP-based generation and converts relative depth from 2D VFMs to metric depth through a two-stage, self-supervised calibration using novel view synthesis. Across nuScenes and SemanticKITTI, the method achieves strong occupancy performance, often surpassing zero-shot metric-depth baselines and approaching supervised methods, particularly in BEV/TPV representations. The results demonstrate a scalable, label-free pathway for vision-centric 3D understanding, with broad implications for reducing annotation burdens and enabling robust 3D perception from 2D data.

Abstract

Estimating the 3D world from 2D monocular images is a fundamental yet challenging task due to the labour-intensive nature of 3D annotations. To simplify label acquisition, this work proposes a novel approach that bridges 2D vision foundation models (VFMs) with 3D tasks by decoupling 3D supervision into an ensemble of image-level primitives, e.g., semantic and geometric components. As a key motivator, we leverage the zero-shot capabilities of vision-language models for image semantics. However, due to the notorious ill-posed problem - multiple distinct 3D scenes can produce identical 2D projections, directly inferring metric depth from a monocular image in a zero-shot manner is unsuitable. In contrast, 2D VFMs provide promising sources of relative depth, which theoretically aligns with metric depth when properly scaled and offset. Thus, we adapt the relative depth derived from VFMs into metric depth by optimising the scale and offset using temporal consistency, also known as novel view synthesis, without access to ground-truth metric depth. Consequently, we project the semantics into 3D space using the reconstructed metric depth, thereby providing 3D supervision. Extensive experiments on nuScenes and SemanticKITTI demonstrate the effectiveness of our framework. For instance, the proposed method surpasses the current state-of-the-art by 3.34% mIoU on nuScenes for voxel occupancy prediction.

Paper Structure

This paper contains 24 sections, 10 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: Adapting relative depth into metric depth. Existing VFM yang2024depth delivers promising relative depth while less capable of metric depth due to ill-posedness. Without access to ground truth, our method leverages the novel view synthesis (\ref{['sec:scale']}) to calibrate the relative depth (zoom out if necessary) into metric one , which aligns well with ground truth depth .
  • Figure 2: Decoupling 3D signal as image primitives.
  • Figure 3: Scheme of the proposed method. We propose decoupling the 3D signals into image primitives, allowing the connection between 2D VFMs and 3D tasks. Given label-free images, their semantic and geometry information are derived from 2D VFMs via self-supervised adaptation (\ref{['sec:decouple']}). The ensemble of the image primitives serves as the alternative to 3D supervision.
  • Figure 4: Illustration of novel view synthesis. We aim to reconstruct the target view from the source view by identifying the pixel correspondence between two views (e.g., the tree indicated by green mask) by \ref{['eq:ttos']}. Consequently, we can optimise the depth scale by minimising the photometric loss \ref{['eq:totalloss']}.
  • Figure 5: Comparison to fine-tuning scheme. We fine-tune DepthAnything yang2024depth on in-domain (nuScenes caesar2020nuscenes) and out-of-domain (SemanticKITTI behley2019semantickitti) depth sources to obtain metric depth. To evaluate the quality of metric depth, we use them to train an occupancy network on nuScenes. Our zero-shot method can surpass them by a clear margin.
  • ...and 2 more figures