Table of Contents
Fetching ...

InfoScale: Unleashing Training-free Variable-scaled Image Generation via Effective Utilization of Information

Guohui Zhang, Jiangtong Tan, Linjiang Huang, Zhonghang Yuan, Mingde Yao, Jie Huang, Feng Zhao

TL;DR

Diffusion models excel at image synthesis but struggle with variable-scale generation due to differences in information content across resolutions. The authors propose InfoScale, a training-free framework comprising Progressive Frequency Compensation, Adaptive Information Aggregation via Dual-Scaled Attention, and Noise Adaptation to address information loss, inflexible aggregation, and distribution misalignment. Through extensive experiments on multiple diffusion models and unseen resolutions, InfoScale demonstrates improved quality, detail, and consistency with competitive or faster inference. This work provides a unified, information-centric view and practical, plug-and-play tools for robust variable-scale diffusion-based generation.

Abstract

Diffusion models (DMs) have become dominant in visual generation but suffer performance drop when tested on resolutions that differ from the training scale, whether lower or higher. In fact, the key challenge in generating variable-scale images lies in the differing amounts of information across resolutions, which requires information conversion procedures to be varied for generating variable-scaled images. In this paper, we investigate the issues of three critical aspects in DMs for a unified analysis in variable-scaled generation: dilated convolution, attention mechanisms, and initial noise. Specifically, 1) dilated convolution in DMs for the higher-resolution generation loses high-frequency information. 2) Attention for variable-scaled image generation struggles to adjust the information aggregation adaptively. 3) The spatial distribution of information in the initial noise is misaligned with variable-scaled image. To solve the above problems, we propose \textbf{InfoScale}, an information-centric framework for variable-scaled image generation by effectively utilizing information from three aspects correspondingly. For information loss in 1), we introduce Progressive Frequency Compensation module to compensate for high-frequency information lost by dilated convolution in higher-resolution generation. For information aggregation inflexibility in 2), we introduce Adaptive Information Aggregation module to adaptively aggregate information in lower-resolution generation and achieve an effective balance between local and global information in higher-resolution generation. For information distribution misalignment in 3), we design Noise Adaptation module to re-distribute information in initial noise for variable-scaled generation. Our method is plug-and-play for DMs and extensive experiments demonstrate the effectiveness in variable-scaled image generation.

InfoScale: Unleashing Training-free Variable-scaled Image Generation via Effective Utilization of Information

TL;DR

Diffusion models excel at image synthesis but struggle with variable-scale generation due to differences in information content across resolutions. The authors propose InfoScale, a training-free framework comprising Progressive Frequency Compensation, Adaptive Information Aggregation via Dual-Scaled Attention, and Noise Adaptation to address information loss, inflexible aggregation, and distribution misalignment. Through extensive experiments on multiple diffusion models and unseen resolutions, InfoScale demonstrates improved quality, detail, and consistency with competitive or faster inference. This work provides a unified, information-centric view and practical, plug-and-play tools for robust variable-scale diffusion-based generation.

Abstract

Diffusion models (DMs) have become dominant in visual generation but suffer performance drop when tested on resolutions that differ from the training scale, whether lower or higher. In fact, the key challenge in generating variable-scale images lies in the differing amounts of information across resolutions, which requires information conversion procedures to be varied for generating variable-scaled images. In this paper, we investigate the issues of three critical aspects in DMs for a unified analysis in variable-scaled generation: dilated convolution, attention mechanisms, and initial noise. Specifically, 1) dilated convolution in DMs for the higher-resolution generation loses high-frequency information. 2) Attention for variable-scaled image generation struggles to adjust the information aggregation adaptively. 3) The spatial distribution of information in the initial noise is misaligned with variable-scaled image. To solve the above problems, we propose \textbf{InfoScale}, an information-centric framework for variable-scaled image generation by effectively utilizing information from three aspects correspondingly. For information loss in 1), we introduce Progressive Frequency Compensation module to compensate for high-frequency information lost by dilated convolution in higher-resolution generation. For information aggregation inflexibility in 2), we introduce Adaptive Information Aggregation module to adaptively aggregate information in lower-resolution generation and achieve an effective balance between local and global information in higher-resolution generation. For information distribution misalignment in 3), we design Noise Adaptation module to re-distribute information in initial noise for variable-scaled generation. Our method is plug-and-play for DMs and extensive experiments demonstrate the effectiveness in variable-scaled image generation.

Paper Structure

This paper contains 16 sections, 7 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: The left and right figures illustrate that higher-resolution image contain greater proportion of high-frequency components and larger information amount.
  • Figure 2: Information loss in dilated convolution. It can be observed that during the steps using dilated convolution, the information amount shows significant decrease, indicating that dilated convolution reduces redundant information, while frequency analysis shows that these information includes some high-frequency components. Vanilla refers to no dilated convolution.
  • Figure 3: The top and bottom figures illustrate the inflexible aggregation ability of model for lower- and higher-resolution generation, respectively. We use dual scaled factors to achieve wider aggregation for lower-resolution generation, and vice versa for higher-resolution generation.
  • Figure 4: The left figure illustrates increasing the variance to adjust the information distribution of the initial noise promotes the information aggregation in lower resolution generation. The high-resolution generation in the right figure is the opposite.
  • Figure 5: Overall framework of InfoScale. (a) In higher resolution generation, the Noise Adaptation (NA) module first modulates the initial noise according to resolution. Then, the Progressive Frequency Compensation (PFC) module extract high-frequency components from cached noise of the previous timestep to compensate for the predicted noise at the current timestep when applying dilated convolution. The Adaptive Information Aggregation module further fuse local (blue) and global information (red). (b) In lower-resolution generation, we also use NA module and replace original self-attention layer with DSAttn.
  • ...and 2 more figures