Table of Contents
Fetching ...

ConDense: Consistent 2D/3D Pre-training for Dense and Sparse Features from Multi-View Images

Xiaoshuai Zhang, Zhicheng Wang, Howard Zhou, Soham Ghosh, Danushen Gnanapragasam, Varun Jampani, Hao Su, Leonidas Guibas

TL;DR

The ConDense framework for 3D pre-training utilizing existing pre-trained 2D networks and large-scale multi-view datasets is introduced and it is demonstrated that the pre-trained model provides good initialization for various 3D tasks including 3D classification and segmentation, outperforming other 3D pre-training methods by a significant margin.

Abstract

To advance the state of the art in the creation of 3D foundation models, this paper introduces the ConDense framework for 3D pre-training utilizing existing pre-trained 2D networks and large-scale multi-view datasets. We propose a novel 2D-3D joint training scheme to extract co-embedded 2D and 3D features in an end-to-end pipeline, where 2D-3D feature consistency is enforced through a volume rendering NeRF-like ray marching process. Using dense per pixel features we are able to 1) directly distill the learned priors from 2D models to 3D models and create useful 3D backbones, 2) extract more consistent and less noisy 2D features, 3) formulate a consistent embedding space where 2D, 3D, and other modalities of data (e.g., natural language prompts) can be jointly queried. Furthermore, besides dense features, ConDense can be trained to extract sparse features (e.g., key points), also with 2D-3D consistency -- condensing 3D NeRF representations into compact sets of decorated key points. We demonstrate that our pre-trained model provides good initialization for various 3D tasks including 3D classification and segmentation, outperforming other 3D pre-training methods by a significant margin. It also enables, by exploiting our sparse features, additional useful downstream tasks, such as matching 2D images to 3D scenes, detecting duplicate 3D scenes, and querying a repository of 3D scenes through natural language -- all quite efficiently and without any per-scene fine-tuning.

ConDense: Consistent 2D/3D Pre-training for Dense and Sparse Features from Multi-View Images

TL;DR

The ConDense framework for 3D pre-training utilizing existing pre-trained 2D networks and large-scale multi-view datasets is introduced and it is demonstrated that the pre-trained model provides good initialization for various 3D tasks including 3D classification and segmentation, outperforming other 3D pre-training methods by a significant margin.

Abstract

To advance the state of the art in the creation of 3D foundation models, this paper introduces the ConDense framework for 3D pre-training utilizing existing pre-trained 2D networks and large-scale multi-view datasets. We propose a novel 2D-3D joint training scheme to extract co-embedded 2D and 3D features in an end-to-end pipeline, where 2D-3D feature consistency is enforced through a volume rendering NeRF-like ray marching process. Using dense per pixel features we are able to 1) directly distill the learned priors from 2D models to 3D models and create useful 3D backbones, 2) extract more consistent and less noisy 2D features, 3) formulate a consistent embedding space where 2D, 3D, and other modalities of data (e.g., natural language prompts) can be jointly queried. Furthermore, besides dense features, ConDense can be trained to extract sparse features (e.g., key points), also with 2D-3D consistency -- condensing 3D NeRF representations into compact sets of decorated key points. We demonstrate that our pre-trained model provides good initialization for various 3D tasks including 3D classification and segmentation, outperforming other 3D pre-training methods by a significant margin. It also enables, by exploiting our sparse features, additional useful downstream tasks, such as matching 2D images to 3D scenes, detecting duplicate 3D scenes, and querying a repository of 3D scenes through natural language -- all quite efficiently and without any per-scene fine-tuning.
Paper Structure (40 sections, 10 equations, 5 figures, 14 tables)

This paper contains 40 sections, 10 equations, 5 figures, 14 tables.

Figures (5)

  • Figure 1: ConDense extract co-embedded feature for 2D or 3D inputs. The model not only has improved performance over previous pre-training methods but also enables efficient cross-modality, cross-scale queries such as 3D retrieval and duplicate detection.
  • Figure 2: Dense feature encoding: the 3D encoding module $\mathcal{G}_{\texttt{3D}}$ is composed of a swappable input processing head $\mathcal{J}_\texttt{3D}$ and a common 3D reasoning backbone $\mathcal{H}_{\texttt{3D}}$. $\mathcal{J}_\texttt{3D}$ maps input 3D scenes of various formats into a feature $\mathbf{J}^s$ in a unified 3D embedding space. $\mathcal{H}_{\texttt{3D}}$ turns $\mathbf{J}^s$ into a 3D feature grid $\mathbf{F}^s$. Through interpolation on $\mathbf{F}^s$ and volume rendering, a 3D-projected feature map $\mathbf{F}_{\texttt{3D}}$ can be obtained and compared with a 2D dense feature map $\mathbf{F}_{\texttt{2D}}$, extracted from the 2D encoding module $\mathcal{G}_{\texttt{2D}}$. The resulting 2D-3D consensus loss $\mathcal{L}_\texttt{2D3D}$ is used as a self-supervision signal. An additional 2D fidelity loss $\mathcal{L}_\texttt{fid}$ is introduced to make sure that the 2D-3D consensus optimized 2D feature $\mathbf{F}_{\texttt{2D}}$ does not deviate too much from the original 2D feature in order to retain some of its semantics and visual richness.
  • Figure 3: Key point prediction: key points are detected in both 2D ($\mathbf{P}_\texttt{2D}$) and 3D ($\mathbf{P}_\texttt{3D}$) based on the existing feature backbones. The 2D-3D key point loss $\mathcal{L}_\texttt{p}$ is used as a self-supervision signal.
  • Figure 4: Visualization of using different types of input to query the target scene repository (ScanNet). Within each pair are query inputs (left) and top-1 query results (right).
  • Figure 5: Visualization of our 2D dense feature reveals its superiority over the Original DINOv2 feature in terms of consistency across multi-view images. Additionally, we present visualizations of sparse feature locations identified by our key point detector.