Table of Contents
Fetching ...

AccDiffusion: An Accurate Method for Higher-Resolution Image Generation

Zhihang Lin, Mingbao Lin, Meng Zhao, Rongrong Ji

TL;DR

The paper tackles object repetition in patch-wise higher-resolution image generation using diffusion models. It introduces AccDiffusion, which decouples image prompts into patch-content-aware prompts derived from cross-attention maps and adds dilated sampling with window interaction to enhance global consistency. Through training-free extrapolation experiments, AccDiffusion achieves state-of-the-art metrics and clearer avoidance of repetition compared with baselines like MultiDiffusion, ScaleCrafter, and DemoFusion. This approach enables high-resolution generation without additional training costs, with practical impact for applications requiring detailed, coherent imagery at large scales.

Abstract

This paper attempts to address the object repetition issue in patch-wise higher-resolution image generation. We propose AccDiffusion, an accurate method for patch-wise higher-resolution image generation without training. An in-depth analysis in this paper reveals an identical text prompt for different patches causes repeated object generation, while no prompt compromises the image details. Therefore, our AccDiffusion, for the first time, proposes to decouple the vanilla image-content-aware prompt into a set of patch-content-aware prompts, each of which serves as a more precise description of an image patch. Besides, AccDiffusion also introduces dilated sampling with window interaction for better global consistency in higher-resolution image generation. Experimental comparison with existing methods demonstrates that our AccDiffusion effectively addresses the issue of repeated object generation and leads to better performance in higher-resolution image generation.

AccDiffusion: An Accurate Method for Higher-Resolution Image Generation

TL;DR

The paper tackles object repetition in patch-wise higher-resolution image generation using diffusion models. It introduces AccDiffusion, which decouples image prompts into patch-content-aware prompts derived from cross-attention maps and adds dilated sampling with window interaction to enhance global consistency. Through training-free extrapolation experiments, AccDiffusion achieves state-of-the-art metrics and clearer avoidance of repetition compared with baselines like MultiDiffusion, ScaleCrafter, and DemoFusion. This approach enables high-resolution generation without additional training costs, with practical impact for applications requiring detailed, coherent imagery at large scales.

Abstract

This paper attempts to address the object repetition issue in patch-wise higher-resolution image generation. We propose AccDiffusion, an accurate method for patch-wise higher-resolution image generation without training. An in-depth analysis in this paper reveals an identical text prompt for different patches causes repeated object generation, while no prompt compromises the image details. Therefore, our AccDiffusion, for the first time, proposes to decouple the vanilla image-content-aware prompt into a set of patch-content-aware prompts, each of which serves as a more precise description of an image patch. Besides, AccDiffusion also introduces dilated sampling with window interaction for better global consistency in higher-resolution image generation. Experimental comparison with existing methods demonstrates that our AccDiffusion effectively addresses the issue of repeated object generation and leads to better performance in higher-resolution image generation.
Paper Structure (28 sections, 12 equations, 13 figures, 3 tables)

This paper contains 28 sections, 12 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Comparison of image quality and GPU overhead for existing higher-resolution generation methods. The GPU memory of Attn-SF jin2023logn and ScaleCrafter he2023scalecrafter significantly increases with resolution, while patch-wise denoising methods ,e.g., MultiDiffusion bar2023multidiffusion and DemoFusion du2023demofusion suffer object repetition issue. Best viewed zoomed in.
  • Figure 2: Image-content-aware prompt v.s. Patch-content-aware prompt.
  • Figure 3: Results of higher-resolution image generation. (a) The result of DemoFusion without text prompt. (b)The result of DemoFusion without residual connection and dilated sampling. (c) The result of dilated sampling without window interaction. (d)The result of our dilated sampling with window interaction.
  • Figure 4: Visualization of averaged attention map from the up blocks and down blocks in U-Net. We reshape the attention map into a 2D shape before visualization. (a) Cross-attention map visualization using open source code hertz2022prompt-to-prompt. (b) Highly responsive regions of each word. (c) The illustration of the patch-level prompt generation process, including morphological operations to eliminate small connected areas. Here we use the word "Astronaut" as an example. All words in the prompt will go through the above process.
  • Figure 5: Illustration of dilated sampling with window interaction: $8 \times 8$ higher-resolution and $4 \times 4$ low-resolution. The number $\{1,2,3,4\}$ represent the different positions within the same window (same color). The interaction operation is conducted in the window.
  • ...and 8 more figures