Table of Contents
Fetching ...

Control3Diff: Learning Controllable 3D Diffusion Models from Single-view Images

Jiatao Gu, Qingzhe Gao, Shuangfei Zhai, Baoquan Chen, Lingjie Liu, Josh Susskind

TL;DR

Control3Diff addresses 3D-aware image synthesis from single-view inputs by coupling diffusion models with a 3D GAN prior (EG3D tri-planes). It learns a latent diffusion over tri-planes to enable controllable generation across diverse conditioning signals, including images, edges, segmentation, and text, without requiring 3D ground truth. The framework supports both conditioning and guidance, including joint camera-pose prediction and Langevin corrections, and demonstrates strong results on FFHQ, AFHQ-cat, and ShapeNet. This approach broadens 3D diffusion to single-view scenarios with flexible conditioning, enabling scalable, controllable 3D synthesis.

Abstract

Diffusion models have recently become the de-facto approach for generative modeling in the 2D domain. However, extending diffusion models to 3D is challenging due to the difficulties in acquiring 3D ground truth data for training. On the other hand, 3D GANs that integrate implicit 3D representations into GANs have shown remarkable 3D-aware generation when trained only on single-view image datasets. However, 3D GANs do not provide straightforward ways to precisely control image synthesis. To address these challenges, We present Control3Diff, a 3D diffusion model that combines the strengths of diffusion models and 3D GANs for versatile, controllable 3D-aware image synthesis for single-view datasets. Control3Diff explicitly models the underlying latent distribution (optionally conditioned on external inputs), thus enabling direct control during the diffusion process. Moreover, our approach is general and applicable to any type of controlling input, allowing us to train it with the same diffusion objective without any auxiliary supervision. We validate the efficacy of Control3Diff on standard image generation benchmarks, including FFHQ, AFHQ, and ShapeNet, using various conditioning inputs such as images, sketches, and text prompts. Please see the project website (\url{https://jiataogu.me/control3diff}) for video comparisons.

Control3Diff: Learning Controllable 3D Diffusion Models from Single-view Images

TL;DR

Control3Diff addresses 3D-aware image synthesis from single-view inputs by coupling diffusion models with a 3D GAN prior (EG3D tri-planes). It learns a latent diffusion over tri-planes to enable controllable generation across diverse conditioning signals, including images, edges, segmentation, and text, without requiring 3D ground truth. The framework supports both conditioning and guidance, including joint camera-pose prediction and Langevin corrections, and demonstrates strong results on FFHQ, AFHQ-cat, and ShapeNet. This approach broadens 3D diffusion to single-view scenarios with flexible conditioning, enabling scalable, controllable 3D synthesis.

Abstract

Diffusion models have recently become the de-facto approach for generative modeling in the 2D domain. However, extending diffusion models to 3D is challenging due to the difficulties in acquiring 3D ground truth data for training. On the other hand, 3D GANs that integrate implicit 3D representations into GANs have shown remarkable 3D-aware generation when trained only on single-view image datasets. However, 3D GANs do not provide straightforward ways to precisely control image synthesis. To address these challenges, We present Control3Diff, a 3D diffusion model that combines the strengths of diffusion models and 3D GANs for versatile, controllable 3D-aware image synthesis for single-view datasets. Control3Diff explicitly models the underlying latent distribution (optionally conditioned on external inputs), thus enabling direct control during the diffusion process. Moreover, our approach is general and applicable to any type of controlling input, allowing us to train it with the same diffusion objective without any auxiliary supervision. We validate the efficacy of Control3Diff on standard image generation benchmarks, including FFHQ, AFHQ, and ShapeNet, using various conditioning inputs such as images, sketches, and text prompts. Please see the project website (\url{https://jiataogu.me/control3diff}) for video comparisons.
Paper Structure (64 sections, 8 equations, 17 figures, 4 tables)

This paper contains 64 sections, 8 equations, 17 figures, 4 tables.

Figures (17)

  • Figure 1: Left is the generation process, where a diffusion model samples a triplane which can be used for image rendering. Right are the examples of controllable generation given various conditioning inputs, showing generated frontal and side views from Control3Diff. The faces shown are all generated by models without real identities due to concerns about individual consent except for the input in (a).
  • Figure 2: Pipeline of Control3Diff. (a) 3D GAN training; (b) Diffusion model trained on the extracted tri-planes can be trained with or without the input conditioning; (c) controllable 3D generation with the learned diffusion model, optionally with guidance. The tri-plane features are presented in three color planes, and the camera poses are omitted for better visual convenience.
  • Figure 3: Comparison for 3D-inversion of in-the-wild images. We compare the proposed approach to direct prediction of the GAN's latent $\mathcal{W}$ and Tri-plane with a learned encoder, as well as an optimization based approach to infer the latent and expanded latent $\mathcal{W}$, $\mathcal{W+}$, as well as the Tri-plane, following abdal2020image2stylegan++. Our method achieves better view consistency with higher output image quality compared to baselines.
  • Figure 4: Comparison on the SR+inversion task. By learning the proper prior with diffusion models, Control3Diff is able to synthesize realistic and faithful cat faces from low-resolution inputs, while optimization-based approaches fail completely due to the lack of proper 3D prior.
  • Figure 5: Comparison on Seg-to-3D generation. All faces are model generated, and are not real identities. Our proposed method generates images that achieve improved alignment with the segmentation map and greater 3D consistency.
  • ...and 12 more figures