Table of Contents
Fetching ...

ResDiT: Evoking the Intrinsic Resolution Scalability in Diffusion Transformers

Yiyang Ma, Feng Zhou, Xuedan Yin, Pu Cao, Yonghao Dang, Jianqin Yin

TL;DR

ResDiT addresses the challenge of high-resolution image synthesis with pre-trained Diffusion Transformers by revealing that position embeddings govern spatial layout while attention range affects detail quality. It introduces a training-free framework that splits attention into a global branch with scaled PEs for layout and a local patch-based branch for texture, complemented by minimum-overlap partitioning, Gaussian splicing, and patch-wise spectral fusion to seamlessly combine outputs. The approach demonstrates competitive performance at 3072×3072 without base-resolution guidance and integrates with control mechanisms like ControlNet, while supporting arbitrary aspect ratios and high-quality local details. Empirical results include thorough ablations showing the necessity of PES, PIPE, and PSF, cementing ResDiT as a simple yet effective solution for intrinsic high-resolution diffusion generation.

Abstract

Leveraging pre-trained Diffusion Transformers (DiTs) for high-resolution (HR) image synthesis often leads to spatial layout collapse and degraded texture fidelity. Prior work mitigates these issues with complex pipelines that first perform a base-resolution (i.e., training-resolution) denoising process to guide HR generation. We instead explore the intrinsic generative mechanisms of DiTs and propose ResDiT, a training-free method that scales resolution efficiently. We identify the core factor governing spatial layout, position embeddings (PEs), and show that the original PEs encode incorrect positional information when extrapolated to HR, which triggers layout collapse. To address this, we introduce a PE scaling technique that rectifies positional encoding under resolution changes. To further remedy low-fidelity details, we develop a local-enhancement mechanism grounded in base-resolution local attention. We design a patch-level fusion module that aggregates global and local cues, together with a Gaussian-weighted splicing strategy that eliminates grid artifacts. Comprehensive evaluations demonstrate that ResDiT consistently delivers high-fidelity, high-resolution image synthesis and integrates seamlessly with downstream tasks, including spatially controlled generation.

ResDiT: Evoking the Intrinsic Resolution Scalability in Diffusion Transformers

TL;DR

ResDiT addresses the challenge of high-resolution image synthesis with pre-trained Diffusion Transformers by revealing that position embeddings govern spatial layout while attention range affects detail quality. It introduces a training-free framework that splits attention into a global branch with scaled PEs for layout and a local patch-based branch for texture, complemented by minimum-overlap partitioning, Gaussian splicing, and patch-wise spectral fusion to seamlessly combine outputs. The approach demonstrates competitive performance at 3072×3072 without base-resolution guidance and integrates with control mechanisms like ControlNet, while supporting arbitrary aspect ratios and high-quality local details. Empirical results include thorough ablations showing the necessity of PES, PIPE, and PSF, cementing ResDiT as a simple yet effective solution for intrinsic high-resolution diffusion generation.

Abstract

Leveraging pre-trained Diffusion Transformers (DiTs) for high-resolution (HR) image synthesis often leads to spatial layout collapse and degraded texture fidelity. Prior work mitigates these issues with complex pipelines that first perform a base-resolution (i.e., training-resolution) denoising process to guide HR generation. We instead explore the intrinsic generative mechanisms of DiTs and propose ResDiT, a training-free method that scales resolution efficiently. We identify the core factor governing spatial layout, position embeddings (PEs), and show that the original PEs encode incorrect positional information when extrapolated to HR, which triggers layout collapse. To address this, we introduce a PE scaling technique that rectifies positional encoding under resolution changes. To further remedy low-fidelity details, we develop a local-enhancement mechanism grounded in base-resolution local attention. We design a patch-level fusion module that aggregates global and local cues, together with a Gaussian-weighted splicing strategy that eliminates grid artifacts. Comprehensive evaluations demonstrate that ResDiT consistently delivers high-fidelity, high-resolution image synthesis and integrates seamlessly with downstream tasks, including spatially controlled generation.

Paper Structure

This paper contains 16 sections, 10 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Qualitative examples of the proposed ResDiT, which enables the pre-trained T2I models to generate high-resolution images than the originally trained resolution, without any training or fine-tuning. Best view ZOOM-IN.
  • Figure 2: Disentangling PE and attention range in high-resolution DiT synthesis. Systematic interventions on positional embeddings (PEs) and attention range across resolutions. (a) At base resolution, a DiT with global attention and vanilla PE produces coherent layouts and fine details. (b) When directly applied to high resolution, layout collapse occurs as the subject becomes shrunken and misplaced due to a mismatch between PE and the attention field. (c) Using a scaled PE restores spatial arrangement but yields blurred details. (d) Applying patch-wise base-resolution PEs ensures correct local structure within each patch, yet details remain degraded. (e) Introducing patch-level local attention further enhances fine details. These results show that positional embeddings determine spatial arrangement, while the attention receptive-field scale governs detail fidelity in DiTs.
  • Figure 3: Overview of ResDiT. ResDiT restructures the vanilla attention mechanism in Diffusion Transformers (DiTs) into two complementary branches to enable training-free resolution scaling. Specifically, the global branch performs global attention with scaled positional embeddings to preserve the overall spatial layout, while the local branch applies patch-level attention to enhance fine-grained details. To maintain continuity across patches, we propose a Minimum-Overlap Partitioning strategy that ensures contextual consistency at patch boundaries and a Gaussian Weighting Splicing scheme that smoothly fuses overlapping regions without introducing grid artifacts. Finally, a Patch-Wise Spectral Fusion module combines the outputs of both branches in the frequency domain, merging low-frequency structural information from the global branch with high-frequency detail components from the local branch, resulting in high-fidelity and high-resolution generation.
  • Figure 4: Qualitative comparison with baselines. ResDiT achieves a coherent global structure without relying on base resolution image information, while simultaneously delivering richer and more delicate local details in high-resolution outputs compared to existing methods. We further compare ResDiT with sota methods in terms of the capacity to generate fine-grained local details. Best View ZOOM-IN.
  • Figure 5: ResDiT seamlessly integrates with ControlNet, enabling precise structure-controlled generation of the images at resolutions of 3072 × 3072. Furthermore, ResDiT supports arbitrary aspect ratios, the images at resolutions of 2048 × 4096 and 4096 × 2048.
  • ...and 3 more figures