Table of Contents
Fetching ...

Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models

NVIDIA, :, Yuval Atzmon, Maciej Bala, Yogesh Balaji, Tiffany Cai, Yin Cui, Jiaojiao Fan, Yunhao Ge, Siddharth Gururani, Jacob Huffman, Ronald Isaac, Pooya Jannaty, Tero Karras, Grace Lam, J. P. Lewis, Aaron Licata, Yen-Chen Lin, Ming-Yu Liu, Qianli Ma, Arun Mallya, Ashlee Martino-Tarr, Doug Mendez, Seungjun Nah, Chris Pruett, Fitsum Reda, Jiaming Song, Ting-Chun Wang, Fangyin Wei, Xiaohui Zeng, Yu Zeng, Qinsheng Zhang

TL;DR

Edify Image presents a novel pixel-space diffusion framework based on multi-scale Laplacian decomposition to enable high-fidelity, controllable image generation at 1K and 4K resolutions. By introducing a dimension-varying diffusion process and a two-stage cascaded architecture (256-base and 1K-upsampler), the method delivers photorealistic outputs with long prompts, diverse aspect ratios, and camera controls. The work further extends capabilities to 4K upsampling, ControlNet-augmented conditioning, 360° HDR panorama generation, and lightweight finetuning for personalization, achieving compatibility with pre-trained ControlNets and demonstrating fairness and style-transfer across subjects and styles. Collectively, Edify Image offers scalable, controllable image synthesis across multiple applications, including panoramic HDR, customization, and high-resolution upsampling, with practical implications for content creation and synthetic data generation.

Abstract

We introduce Edify Image, a family of diffusion models capable of generating photorealistic image content with pixel-perfect accuracy. Edify Image utilizes cascaded pixel-space diffusion models trained using a novel Laplacian diffusion process, in which image signals at different frequency bands are attenuated at varying rates. Edify Image supports a wide range of applications, including text-to-image synthesis, 4K upsampling, ControlNets, 360 HDR panorama generation, and finetuning for image customization.

Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models

TL;DR

Edify Image presents a novel pixel-space diffusion framework based on multi-scale Laplacian decomposition to enable high-fidelity, controllable image generation at 1K and 4K resolutions. By introducing a dimension-varying diffusion process and a two-stage cascaded architecture (256-base and 1K-upsampler), the method delivers photorealistic outputs with long prompts, diverse aspect ratios, and camera controls. The work further extends capabilities to 4K upsampling, ControlNet-augmented conditioning, 360° HDR panorama generation, and lightweight finetuning for personalization, achieving compatibility with pre-trained ControlNets and demonstrating fairness and style-transfer across subjects and styles. Collectively, Edify Image offers scalable, controllable image synthesis across multiple applications, including panoramic HDR, customization, and high-resolution upsampling, with practical implications for content creation and synthetic data generation.

Abstract

We introduce Edify Image, a family of diffusion models capable of generating photorealistic image content with pixel-perfect accuracy. Edify Image utilizes cascaded pixel-space diffusion models trained using a novel Laplacian diffusion process, in which image signals at different frequency bands are attenuated at varying rates. Edify Image supports a wide range of applications, including text-to-image synthesis, 4K upsampling, ControlNets, 360 HDR panorama generation, and finetuning for image customization.

Paper Structure

This paper contains 55 sections, 18 equations, 26 figures, 2 tables.

Figures (26)

  • Figure 1: Edify Image can generate photorealistic high-resolution images from text prompts. Our models support a range of capabilities, including (a) Text-to-image generation, (b) Finetuning, (c) Generation with additional control, and (d) Panorama generation. For (b) and (c), an example of a finetuning image and the control input are provided in the bottom left corner, respectively. Best viewed with Acrobat Reader. Click the panorama image to play the video clip.
  • Figure 2: Laplacian diffusion for multi-resolution image generation. (Top) Image Laplacian Decomposition. Each image sample ${\mathbf{x}}$ can be decomposed into a set of components. The example shows three components, ${\mathbf{x}} = {\mathbf{x}}^{(1)} + \text{up}({\mathbf{x}}^{(2)}) + \text{up}(\text{up}({\mathbf{x}}^{(3)}))$. This decomposition is implemented using basic upsampling and downsampling operations, where each component corresponds to different frequency bands. The function $\mu({\mathbf{x}}, t)$ represents a weighted sum of these components across different frequency spaces. (Middle) Forward Noising Process. Components are attenuated at different rates, with higher frequencies attenuated more rapidly than lower ones. We use the decaying background color in the top part of the figure to illustrate the attenuation factors. As a result, the signal-to-noise ratio (SNR) diminishes faster in the high-frequency components, allowing them to be discarded without significant loss of information once their attenuation coefficients approach zero. (Bottom) Backward Sampling Process. Denoisers are trained at multiple stages to generate images at various resolutions. We decompose the noise into a noise Laplacian pyramid. The Laplacian Diffusion process synthesizes higher-resolution images by first upsampling a lower-resolution noisy sample and then denoising it, with random noise injected into the corresponding components during upsampling. When operating solely at the lowest resolution, the process reduces to standard EDM.
  • Figure 3: Model architecture. As shown in the left panel, our diffusion models use a U-Net based architecture with a sequence of residual blocks with skip connections. We use wavelet and Inverse wavelet transform at the beginning and end of the network to bring down the spatial resolution of the images. In the right panel, we show how the $256$ and $1K$-resolution models are combined in a 2-stage cascade to generate the $1024$-resolution image.
  • Figure 4: Samples generated by our text-to-image model with 16:9, 1:1 and 9:16 aspect ratios.
  • Figure 5: Long prompt generation. Edify Image can faithfully generate images from long descriptive prompts.
  • ...and 21 more figures