Table of Contents
Fetching ...

ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models

Lukas Höllein, Aljaž Božič, Norman Müller, David Novotny, Hung-Yu Tseng, Christian Richardt, Michael Zollhöfer, Matthias Nießner

TL;DR

ViewDiff introduces a 3D-consistent image generation framework that repurposes pretrained 2D text-to-image diffusion models as a 3D-aware prior. It augments the U-Net with a cross-frame-attention layer and a projection layer to encode 3D structure, enabling joint denoising across multiple views and NeRF-like rendering within a single forward pass. An autoregressive scheme allows rendering additional views from novel viewpoints, yielding 3D-consistent outputs with authentic backgrounds, while training on real multi-view datasets. The approach achieves photorealistic, diverse results with strong improvements in FID and KID compared to baselines, and demonstrates clear potential for real-world object rendering and downstream 3D reconstruction tasks.

Abstract

3D asset generation is getting massive amounts of attention, inspired by the recent success of text-guided 2D content creation. Existing text-to-3D methods use pretrained text-to-image diffusion models in an optimization problem or fine-tune them on synthetic data, which often results in non-photorealistic 3D objects without backgrounds. In this paper, we present a method that leverages pretrained text-to-image models as a prior, and learn to generate multi-view images in a single denoising process from real-world data. Concretely, we propose to integrate 3D volume-rendering and cross-frame-attention layers into each block of the existing U-Net network of the text-to-image model. Moreover, we design an autoregressive generation that renders more 3D-consistent images at any viewpoint. We train our model on real-world datasets of objects and showcase its capabilities to generate instances with a variety of high-quality shapes and textures in authentic surroundings. Compared to the existing methods, the results generated by our method are consistent, and have favorable visual quality (-30% FID, -37% KID).

ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models

TL;DR

ViewDiff introduces a 3D-consistent image generation framework that repurposes pretrained 2D text-to-image diffusion models as a 3D-aware prior. It augments the U-Net with a cross-frame-attention layer and a projection layer to encode 3D structure, enabling joint denoising across multiple views and NeRF-like rendering within a single forward pass. An autoregressive scheme allows rendering additional views from novel viewpoints, yielding 3D-consistent outputs with authentic backgrounds, while training on real multi-view datasets. The approach achieves photorealistic, diverse results with strong improvements in FID and KID compared to baselines, and demonstrates clear potential for real-world object rendering and downstream 3D reconstruction tasks.

Abstract

3D asset generation is getting massive amounts of attention, inspired by the recent success of text-guided 2D content creation. Existing text-to-3D methods use pretrained text-to-image diffusion models in an optimization problem or fine-tune them on synthetic data, which often results in non-photorealistic 3D objects without backgrounds. In this paper, we present a method that leverages pretrained text-to-image models as a prior, and learn to generate multi-view images in a single denoising process from real-world data. Concretely, we propose to integrate 3D volume-rendering and cross-frame-attention layers into each block of the existing U-Net network of the text-to-image model. Moreover, we design an autoregressive generation that renders more 3D-consistent images at any viewpoint. We train our model on real-world datasets of objects and showcase its capabilities to generate instances with a variety of high-quality shapes and textures in authentic surroundings. Compared to the existing methods, the results generated by our method are consistent, and have favorable visual quality (-30% FID, -37% KID).
Paper Structure (47 sections, 5 equations, 19 figures, 4 tables)

This paper contains 47 sections, 5 equations, 19 figures, 4 tables.

Figures (19)

  • Figure 1: Multi-view consistent image generation. Our method takes as input a text description, or any number of posed input images, and generates high-quality, multi-view consistent images of a real-world 3D object in authentic surroundings from any desired camera poses.
  • Figure 2: Method Overview. We augment the U-Net architecture of pretrained text-to-image models with new layers in every U-Net block. These layers facilitate communication between multi-view images in a batch, resulting in a denoising process that jointly produces 3D-consistent images. First, we replace self-attention with cross-frame-attention (yellow) which compares the spatial features of all views. We condition all attention layers on pose ($RT$), intrinsics ($K$), and intensity ($I$) of each image. Second, we add a projection layer (green) into the inner blocks of the U-Net. It creates a 3D representation from multi-view features and renders them into 3D-consistent features. We fine-tune the U-Net using the diffusion denoising objective (\ref{['eq:ddpm-eps-loss']}) at timestep $t$, supervised from captioned multi-view images.
  • Figure 3: Architecture of the projection layer. We produce 3D-consistent output features from posed input features. First, we unproject the compressed image features into 3D and aggregate them into a joint voxel grid with an MLP. Then we refine the voxel grid with a 3D CNN. A volume renderer similar to NeRF mildenhall2021nerf renders 3D-consistent features from the grid. Finally, we apply a learned scale function and expand the feature dimension.
  • Figure 4: Unconditional image generation of our method and baselines. We show renderings from different viewpoints for multiple objects and categories. Our method produces consistent objects and backgrounds. Our textures are sharper in comparison to baselines. Please see the supplemental material for more examples and animations.
  • Figure 5: Multi-view consistency of unconditional image generation. HoloFusion (HF) karnewar2023holofusion has view-dependent floating artifacts (the base in first row). ViewsetDiffusion (VD) szymanowicz23viewset_diffusion has blurrier renderings (second row). Without the projection layer, our method has no precise control over viewpoints (third row). Without cross-frame-attention, our method suffers from identity changes of the object (fourth row). Our full method produces detailed images that are 3D-consistent (fifth row).
  • ...and 14 more figures