Table of Contents
Fetching ...

Exploring Position Encoding in Diffusion U-Net for Training-free High-resolution Image Generation

Feng Zhou, Pu Cao, Yiyang Ma, Lu Yang, Jianqin Yin

TL;DR

This work identifies inconsistent position encoding as the root cause of repetitive and disordered patterns when generating high-resolution images with a pre-trained diffusion U-Net. It introduces Progressive Boundary Complement (PBC), a training-free approach that inserts hierarchical virtual boundaries and employs valued-padding to enhance the propagation of position information from feature-map edges to central regions. Through quantitative analyses and extensive experiments on SD-XL, PBC yields superior high-resolution image quality and content richness, including non-square outputs, with modest computational overhead. The method provides a simple, architecture-agnostic route to improve high-resolution diffusion-based image generation with enriched content and structural coherence.

Abstract

Denoising higher-resolution latents via a pre-trained U-Net leads to repetitive and disordered image patterns. Although recent studies make efforts to improve generative quality by aligning denoising process across original and higher resolutions, the root cause of suboptimal generation is still lacking exploration. Through comprehensive analysis of position encoding in U-Net, we attribute it to inconsistent position encoding, sourced by the inadequate propagation of position information from zero-padding to latent features in convolution layers as resolution increases. To address this issue, we propose a novel training-free approach, introducing a Progressive Boundary Complement (PBC) method. This method creates dynamic virtual image boundaries inside the feature map to enhance position information propagation, enabling high-quality and rich-content high-resolution image synthesis. Extensive experiments demonstrate the superiority of our method.

Exploring Position Encoding in Diffusion U-Net for Training-free High-resolution Image Generation

TL;DR

This work identifies inconsistent position encoding as the root cause of repetitive and disordered patterns when generating high-resolution images with a pre-trained diffusion U-Net. It introduces Progressive Boundary Complement (PBC), a training-free approach that inserts hierarchical virtual boundaries and employs valued-padding to enhance the propagation of position information from feature-map edges to central regions. Through quantitative analyses and extensive experiments on SD-XL, PBC yields superior high-resolution image quality and content richness, including non-square outputs, with modest computational overhead. The method provides a simple, architecture-agnostic route to improve high-resolution diffusion-based image generation with enriched content and structural coherence.

Abstract

Denoising higher-resolution latents via a pre-trained U-Net leads to repetitive and disordered image patterns. Although recent studies make efforts to improve generative quality by aligning denoising process across original and higher resolutions, the root cause of suboptimal generation is still lacking exploration. Through comprehensive analysis of position encoding in U-Net, we attribute it to inconsistent position encoding, sourced by the inadequate propagation of position information from zero-padding to latent features in convolution layers as resolution increases. To address this issue, we propose a novel training-free approach, introducing a Progressive Boundary Complement (PBC) method. This method creates dynamic virtual image boundaries inside the feature map to enhance position information propagation, enabling high-quality and rich-content high-resolution image synthesis. Extensive experiments demonstrate the superiority of our method.

Paper Structure

This paper contains 29 sections, 5 equations, 13 figures, 8 tables, 1 algorithm.

Figures (13)

  • Figure 1: Directly generating high-resolution images using a pre-trained Latent Diffusion Model result in repetitive (right) and disordered (left) patterns.
  • Figure 2: Trench-style Zero-padding Technique. The left-side graph diagram illustrates the process of applying bi-directional zero-padding to the feature map in the convolution operation. The three images on the right show 1024×1024 resolution outputs using this technique, with the corresponding split sketches displayed below.
  • Figure 3: Padding Type Analysis. We evaluate the effect of different padding types.
  • Figure 4: Position Information Correction. We applied unidirectional zero-padding to the central region of the feature map to facilitate faster propagation of position information. The images were generated at a resolution of 1024×1024, with the central region measuring 512×512. Prompt: A photo of a teddy bear riding a bike in Times Square.
  • Figure 5: Position Information Quantification. We extract the feature from the last layer of U-Net and map it to a target position map with a trainable linear layer. The loss of the linear year reflects how much position information the feature contains.
  • ...and 8 more figures