Ultra-High-Resolution Image Synthesis with Pyramid Diffusion Model

Jiajie Yang

Ultra-High-Resolution Image Synthesis with Pyramid Diffusion Model

Jiajie Yang

TL;DR

The Pyramid Diffusion Model is introduced, a novel architecture designed for ultra-high-resolution image synthesis that achieves the synthesis of images with a 2K resolution for the first time, demonstrated on two new datasets comprising images of sizes 2048x2048 pixels and 2048x1024 pixels respectively.

Abstract

We introduce the Pyramid Diffusion Model (PDM), a novel architecture designed for ultra-high-resolution image synthesis. PDM utilizes a pyramid latent representation, providing a broader design space that enables more flexible, structured, and efficient perceptual compression which enable AutoEncoder and Network of Diffusion to equip branches and deeper layers. To enhance PDM's capabilities for generative tasks, we propose the integration of Spatial-Channel Attention and Res-Skip Connection, along with the utilization of Spectral Norm and Decreasing Dropout Strategy for the Diffusion Network and AutoEncoder. In summary, PDM achieves the synthesis of images with a 2K resolution for the first time, demonstrated on two new datasets comprising images of sizes 2048x2048 pixels and 2048x1024 pixels respectively. We believe that this work offers an alternative approach to designing scalable image generative models, while also providing incremental reinforcement for existing frameworks.

Ultra-High-Resolution Image Synthesis with Pyramid Diffusion Model

TL;DR

Abstract

Paper Structure (17 sections, 8 equations, 13 figures, 1 table)

This paper contains 17 sections, 8 equations, 13 figures, 1 table.

Introduction
Related Work
Method
Rectified Flow
Latent Rectified Flow Models
Pyramid Diffusion Models
Experiment
Dataset
Unconditioned Image Generation
Visualization of Pyramid Latent
Conclusion
Details on Implementation
Unconditional Samples
More samples on SCAPES2K
More samples on People2K
...and 2 more sections

Figures (13)

Figure 1: Pyramid Diffusion Model could generate really-high-resolution images unconditionally. The displayed image contains 2048*1024 pixels.
Figure 2: The Spatial-Channel attention calculates spatial attention and channel attention concurrently. After obtaining the two weighted features, we re-weight these features with the ratio of importance measured by the scale of pixels or channels. Our Spatial-Channel attention offers the ability for self-adaptation to intense changes in resolution and channels as it takes care of both channel and image features simultaneously.
Figure 3: (Left) The figure illustrates the ratio of importance of spatial attention and channel attention across different layers in the neural network. Following the conventional CNN design, down-sampling is accompanied by an increase in channel dimensions. A noticeable fluctuation in importance between channel and spatial attention is observed in deeper layers of the network compared to the initial layers. (Right) The table presents the data corresponding to the left figure.
Figure 4: The Residual and Skip Connection design is illustrated in the up-sampling stream (above the gray dotted line) and the down-sampling stream (below the gray dotted line). The images in the left column depict the structure of Input/Output skips, the middle column illustrates the structure of Residual nets, and the right column describes the structure of Res-Skip connections. We implement the Up-sampling and Down-sampling techniques as outlined in song2020score in the Pyramid NCSNpp. Meanwhile, in the Decoder, we employ nearest neighbor interpolation for up-sampling, and in the Encoder, we use pooling (max-pooling or average pooling) for down-sampling. Additionally, tRGB and fRGB modules are utilized to convert between RGB and high-dimensional per-pixel data.
Figure 5: The main architecture of Pyramid AutoEncoder. The main architecture of the AutoEncoder consists of several components. The component enclosed in blue boxes represents the Input/Output Skip mechanism proposed in styleganplus. The input image is sent to a stream of Input-Skip and down-sampling stream, where it is encoded into a latent representation. Subsequently, the latent representation is forwarded to the Decoder for image reconstruction. It is worth noting that our AutoEncoder design assigns different importance to the Encoder and Decoder. The Encoder adopts lightweight components, heavy regulation, and a non-branch design, while the Decoder utilizes heavyweight components, light regulation, and a branch design.
...and 8 more figures

Ultra-High-Resolution Image Synthesis with Pyramid Diffusion Model

TL;DR

Abstract

Ultra-High-Resolution Image Synthesis with Pyramid Diffusion Model

Authors

TL;DR

Abstract

Table of Contents

Figures (13)