Table of Contents
Fetching ...

Lift3D: Zero-Shot Lifting of Any 2D Vision Model to 3D

Mukund Varma T, Peihao Wang, Zhiwen Fan, Zhangyang Wang, Hao Su, Ravi Ramamoorthi

TL;DR

The question of whether any 2D vision model can be lifted to make 3D consistent predictions is asked and answered in the affirmative; the new Lift3D method trains to predict unseen views onfeature spaces generated by afew visual models (i.e. DINO and CLIP), but then generalizes to novel vision operators and tasks.

Abstract

In recent years, there has been an explosion of 2D vision models for numerous tasks such as semantic segmentation, style transfer or scene editing, enabled by large-scale 2D image datasets. At the same time, there has been renewed interest in 3D scene representations such as neural radiance fields from multi-view images. However, the availability of 3D or multiview data is still substantially limited compared to 2D image datasets, making extending 2D vision models to 3D data highly desirable but also very challenging. Indeed, extending a single 2D vision operator like scene editing to 3D typically requires a highly creative method specialized to that task and often requires per-scene optimization. In this paper, we ask the question of whether any 2D vision model can be lifted to make 3D consistent predictions. We answer this question in the affirmative; our new Lift3D method trains to predict unseen views on feature spaces generated by a few visual models (i.e. DINO and CLIP), but then generalizes to novel vision operators and tasks, such as style transfer, super-resolution, open vocabulary segmentation and image colorization; for some of these tasks, there is no comparable previous 3D method. In many cases, we even outperform state-of-the-art methods specialized for the task in question. Moreover, Lift3D is a zero-shot method, in the sense that it requires no task-specific training, nor scene-specific optimization.

Lift3D: Zero-Shot Lifting of Any 2D Vision Model to 3D

TL;DR

The question of whether any 2D vision model can be lifted to make 3D consistent predictions is asked and answered in the affirmative; the new Lift3D method trains to predict unseen views onfeature spaces generated by afew visual models (i.e. DINO and CLIP), but then generalizes to novel vision operators and tasks.

Abstract

In recent years, there has been an explosion of 2D vision models for numerous tasks such as semantic segmentation, style transfer or scene editing, enabled by large-scale 2D image datasets. At the same time, there has been renewed interest in 3D scene representations such as neural radiance fields from multi-view images. However, the availability of 3D or multiview data is still substantially limited compared to 2D image datasets, making extending 2D vision models to 3D data highly desirable but also very challenging. Indeed, extending a single 2D vision operator like scene editing to 3D typically requires a highly creative method specialized to that task and often requires per-scene optimization. In this paper, we ask the question of whether any 2D vision model can be lifted to make 3D consistent predictions. We answer this question in the affirmative; our new Lift3D method trains to predict unseen views on feature spaces generated by a few visual models (i.e. DINO and CLIP), but then generalizes to novel vision operators and tasks, such as style transfer, super-resolution, open vocabulary segmentation and image colorization; for some of these tasks, there is no comparable previous 3D method. In many cases, we even outperform state-of-the-art methods specialized for the task in question. Moreover, Lift3D is a zero-shot method, in the sense that it requires no task-specific training, nor scene-specific optimization.
Paper Structure (33 sections, 4 equations, 9 figures, 5 tables)

This paper contains 33 sections, 4 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Imagine we are using a 2D vision operator, such as semantic segmentation or scene editing, on multiple-view input images. This often leads to inconsistent predictions across different views (as shown in the middle column). To address this, we introduce Lift3D, a framework designed to transform these inconsistent 2D outputs into view-consistent 3D predictions (illustrated in the right column). Our approach is both scene and operator-agnostic, meaning it can adapt to any downstream task or scene without additional adjustments. We demonstrate how Lift3D effectively resolves inconsistencies in multi-view predictions across open vocabulary segmentation and text-driven scene editing. Notice the color discrepancies in the same rightmost chair across two views (varying from reddish to greenish) in the 2D results at the bottom row, and the inconsistencies in facial and hair color. For a clearer comparison between the 2D and 3D outcomes, we recommend zooming into the electronic version of this image.
  • Figure 2: Overview of Lift3D: 1) Given multi-view images of a scene, we first extract an intermediate feature map using an encoder for each view independently, 2) Using the source view features, we estimate the target view feature map via an extended generalizable novel view synthesis pipeline that learns to correct feature space inconsistency and fuse information according to 3D geometry priors, and 3) Directly use the decoder from the pre-trained visual model to process the estimated feature map and synthesize the final prediction for downstream tasks.
  • Figure 3: (a) Target View, (b) We visualize the PCA (3-channel) map of the interpolated feature maps from our baseline that naively shares blending and ray marching weights between color rendering and 2D encoder features. (c) A similar PCA map is estimated by directly encoding the target view with the 2D encoder. We can see that naively sharing weights without accounting for the inconsistencies in the source view features results in noisy feature estimation, undesirable for several fine-grained downstream tasks.
  • Figure 4: Qualitative results for semantic segmentation using user-provided stroke input. In the chess table scene (row 1), our method can derive fine-grained semantic masks, especially around the stem of the table. In the shoe rack scene (row 2), our method can successfully discover the shoe laces and shoe sole better than other methods.
  • Figure 5: Qualitative results for 3D scene style transfer. Our method achieves better retention of the original geometry (row 1) and is more coherent to the style image (row 2).
  • ...and 4 more figures