Table of Contents
Fetching ...

Isotropic3D: Image-to-3D Generation Based on a Single CLIP Embedding

Pengkun Liu, Yikai Wang, Fuchun Sun, Jiafang Li, Hang Xiao, Hongxiang Xue, Xinzhou Wang

TL;DR

This paper tackles single-image to 3D generation by removing hard L2 supervision and conditioning on a single CLIP embedding. It introduces Isotropic3D, a two-stage diffusion fine-tuning framework that first converts a text-to-3D model into an image-conditioned model and then applies Explicit Multi-view Attention to fuse noisy multi-view renders with a clean reference as an explicit condition, while discarding the reference image during inference. NeRF optimization with Score Distillation Sampling and an orientation loss yields multi-view-consistent geometry and rich texture, achieving improvements in both novel view synthesis and 3D quality over prior SDS-based methods. The approach demonstrates strong geometry, texture, and view-consistency advantages, with practical impact for efficient, reference-free 3D content generation from a single CLIP embedding.

Abstract

Encouraged by the growing availability of pre-trained 2D diffusion models, image-to-3D generation by leveraging Score Distillation Sampling (SDS) is making remarkable progress. Most existing methods combine novel-view lifting from 2D diffusion models which usually take the reference image as a condition while applying hard L2 image supervision at the reference view. Yet heavily adhering to the image is prone to corrupting the inductive knowledge of the 2D diffusion model leading to flat or distorted 3D generation frequently. In this work, we reexamine image-to-3D in a novel perspective and present Isotropic3D, an image-to-3D generation pipeline that takes only an image CLIP embedding as input. Isotropic3D allows the optimization to be isotropic w.r.t. the azimuth angle by solely resting on the SDS loss. The core of our framework lies in a two-stage diffusion model fine-tuning. Firstly, we fine-tune a text-to-3D diffusion model by substituting its text encoder with an image encoder, by which the model preliminarily acquires image-to-image capabilities. Secondly, we perform fine-tuning using our Explicit Multi-view Attention (EMA) which combines noisy multi-view images with the noise-free reference image as an explicit condition. CLIP embedding is sent to the diffusion model throughout the whole process while reference images are discarded once after fine-tuning. As a result, with a single image CLIP embedding, Isotropic3D is capable of generating multi-view mutually consistent images and also a 3D model with more symmetrical and neat content, well-proportioned geometry, rich colored texture, and less distortion compared with existing image-to-3D methods while still preserving the similarity to the reference image to a large extent. The project page is available at https://isotropic3d.github.io/. The code and models are available at https://github.com/pkunliu/Isotropic3D.

Isotropic3D: Image-to-3D Generation Based on a Single CLIP Embedding

TL;DR

This paper tackles single-image to 3D generation by removing hard L2 supervision and conditioning on a single CLIP embedding. It introduces Isotropic3D, a two-stage diffusion fine-tuning framework that first converts a text-to-3D model into an image-conditioned model and then applies Explicit Multi-view Attention to fuse noisy multi-view renders with a clean reference as an explicit condition, while discarding the reference image during inference. NeRF optimization with Score Distillation Sampling and an orientation loss yields multi-view-consistent geometry and rich texture, achieving improvements in both novel view synthesis and 3D quality over prior SDS-based methods. The approach demonstrates strong geometry, texture, and view-consistency advantages, with practical impact for efficient, reference-free 3D content generation from a single CLIP embedding.

Abstract

Encouraged by the growing availability of pre-trained 2D diffusion models, image-to-3D generation by leveraging Score Distillation Sampling (SDS) is making remarkable progress. Most existing methods combine novel-view lifting from 2D diffusion models which usually take the reference image as a condition while applying hard L2 image supervision at the reference view. Yet heavily adhering to the image is prone to corrupting the inductive knowledge of the 2D diffusion model leading to flat or distorted 3D generation frequently. In this work, we reexamine image-to-3D in a novel perspective and present Isotropic3D, an image-to-3D generation pipeline that takes only an image CLIP embedding as input. Isotropic3D allows the optimization to be isotropic w.r.t. the azimuth angle by solely resting on the SDS loss. The core of our framework lies in a two-stage diffusion model fine-tuning. Firstly, we fine-tune a text-to-3D diffusion model by substituting its text encoder with an image encoder, by which the model preliminarily acquires image-to-image capabilities. Secondly, we perform fine-tuning using our Explicit Multi-view Attention (EMA) which combines noisy multi-view images with the noise-free reference image as an explicit condition. CLIP embedding is sent to the diffusion model throughout the whole process while reference images are discarded once after fine-tuning. As a result, with a single image CLIP embedding, Isotropic3D is capable of generating multi-view mutually consistent images and also a 3D model with more symmetrical and neat content, well-proportioned geometry, rich colored texture, and less distortion compared with existing image-to-3D methods while still preserving the similarity to the reference image to a large extent. The project page is available at https://isotropic3d.github.io/. The code and models are available at https://github.com/pkunliu/Isotropic3D.
Paper Structure (19 sections, 10 equations, 17 figures, 1 table)

This paper contains 19 sections, 10 equations, 17 figures, 1 table.

Figures (17)

  • Figure 1: Isotropic3D is a novel framework to generate multiview-consistent and high-quality 3D content from a single CLIP embedding of the reference image. Our method is proficient in generating multi-view images that maintain mutual consistency, as well as producing a 3D model characterized by symmetrical and neat content, regular geometry, rich colored texture, and less distortion, all while preserving similarity.
  • Figure 2: The pipeline of Isotropic3D. Neural Radiance Field (NeRF) utilizes volume rendering to extract four orthogonal views, which are subsequently augmented with random Gaussian noise. These views, along with noise-free reference images, are then transferred to a multi-view diffusion model for predicting added noise. Note that, we set the timestep $t$ to zero at the corresponding position of noise-free reference images. The framework that generates consistent multi-view images from only a single CLIP embedding can be aligned with the input view while retaining the consistency of the output target view. Finally, NeRF yields high-quality 3D content optimized by rendered images via Score Distillation Sampling (SDS). $\mathcal{L_{SDS}}$ can refer to \ref{['eq:sds']}.
  • Figure 3: View-Conditioned Multi-view Diffusion pipeline. Our training process is divided into two stages. In the first stage (Stage1), we fine-tune a text-to-3D diffusion model by substituting its text encoder with an image encoder, by which the model preliminarily acquires image-to-image capabilities. Stage1-a and Stage1-b are the single-view diffusion branch and the multi-view diffusion branch for the first stage respectively. In the second stage (Stage2), we perform fine-tuning multi-view diffusion model integrated Explicit Multi-view Attention (EMA). EMA combines noisy multi-view images with the noise-free reference image as an explicit condition. Stage2-a and Stage2-b are diffusion branches for the second stage. During inference, we only need to send the CLIP embedding of the reference image and camera pose to generate consistent high-quality images from multiple perspectives.
  • Figure 4: Illustration of the Explicit Multi-view Attention (EMA). "View-Input" is a feature map of the noise-free reference image. "View 1" and "View 1 $\sim$ 4" are feature maps of noisy rendered views. "Alternative" means a 30% chance of using single-view diffusion (Stage2-a) and a 70% chance of training with the multi-view diffusion branch (Stage2-b).
  • Figure 5: Qualitative comparison of synthesizing novel views with baseline models liu2023zeroliu2023syncdreamer on GSO downs2022google and randomly collected images.
  • ...and 12 more figures