Table of Contents
Fetching ...

Zoomed In, Diffused Out: Towards Local Degradation-Aware Multi-Diffusion for Extreme Image Super-Resolution

Brian B. Moser, Stanislav Frolov, Tobias C. Nauen, Federico Raue, Andreas Dengel

TL;DR

A novel approach is introduced that enables T2I diffusion models to generate 2K, 4K, and even 8K images without any additional training, and unlocks higher resolutions, allowing T2I diffusion models to be applied to image SR tasks without limitation on resolution.

Abstract

Large-scale, pre-trained Text-to-Image (T2I) diffusion models have gained significant popularity in image generation tasks and have shown unexpected potential in image Super-Resolution (SR). However, most existing T2I diffusion models are trained with a resolution limit of 512x512, making scaling beyond this resolution an unresolved but necessary challenge for image SR. In this work, we introduce a novel approach that, for the first time, enables these models to generate 2K, 4K, and even 8K images without any additional training. Our method leverages MultiDiffusion, which distributes the generation across multiple diffusion paths to ensure global coherence at larger scales, and local degradation-aware prompt extraction, which guides the T2I model to reconstruct fine local structures according to its low-resolution input. These innovations unlock higher resolutions, allowing T2I diffusion models to be applied to image SR tasks without limitation on resolution.

Zoomed In, Diffused Out: Towards Local Degradation-Aware Multi-Diffusion for Extreme Image Super-Resolution

TL;DR

A novel approach is introduced that enables T2I diffusion models to generate 2K, 4K, and even 8K images without any additional training, and unlocks higher resolutions, allowing T2I diffusion models to be applied to image SR tasks without limitation on resolution.

Abstract

Large-scale, pre-trained Text-to-Image (T2I) diffusion models have gained significant popularity in image generation tasks and have shown unexpected potential in image Super-Resolution (SR). However, most existing T2I diffusion models are trained with a resolution limit of 512x512, making scaling beyond this resolution an unresolved but necessary challenge for image SR. In this work, we introduce a novel approach that, for the first time, enables these models to generate 2K, 4K, and even 8K images without any additional training. Our method leverages MultiDiffusion, which distributes the generation across multiple diffusion paths to ensure global coherence at larger scales, and local degradation-aware prompt extraction, which guides the T2I model to reconstruct fine local structures according to its low-resolution input. These innovations unlock higher resolutions, allowing T2I diffusion models to be applied to image SR tasks without limitation on resolution.

Paper Structure

This paper contains 18 sections, 1 equation, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: Comparison of 4x Super-Resolution (SR) predictions using SeeSR + MultiDiffusion (MD) and our proposed method. While MD unlocks higher resolutions beyond 512$\times$512, our proposed strategy of extracting local degradation-aware prompts ensures local detail awareness, improving fine-grained structure restoration, as demonstrated in the stones regions.
  • Figure 2: Illustration of SeeSR (top) compared to our local degradation-aware method (bottom). While SeeSR is limited to a fixed image size of 512$\times$512, our method can technically upscale to any resolution due to two components: MultiDiffusion (MD) and local degradation-aware prompt extraction. The MD process is applied to overlapping tiles. Without local degradation-aware prompt extraction, the classifier guidance generates hallucinated details based on global prompts that describe the entire image, leading to inconsistencies in local tile content. Our approach incorporates local tag extraction and, thereby, provides tile-specific prompts, ensuring more accurate and coherent detail generation across the entire image.
  • Figure 3: Qualitative comparison of a 2K image (899; DIV2K Val) between LR, SeeSR+MD (PSNR$\uparrow$:25.273, SSIM$\uparrow$:0.769, LPIPS$\downarrow$:0.130), our method (PSNR$\uparrow$:26.217, SSIM$\uparrow$:0.794, LPIPS$\downarrow$:0.103), and HR. In general, we observe that our method reconstructs details in background objects better than SeeSR+MD (see light patterns in the lower left corner).
  • Figure 4: Qualitative comparison (886; DIV2K Val) between SeeSR+MD (PSNR$\uparrow$:26.252, SSIM$\uparrow$:0.766, LPIPS$\downarrow$:0.123) and our approach (PSNR$\uparrow$:28.115, SSIM$\uparrow$:0.802, LPIPS$\downarrow$:0.091). The global tags were "balustrade, bird, blue, fence, green, macaw, parrot, perch, pole, rail, sit, stand, yellow". While SeeSR+MD hallucinates bird patterns on the leaves in the background due to global prompt guidance, our approach preserves local coherence by reconstructing leaves more naturally. However, although the background is content-wise accurate, our method introduces more fine-grained, blurry-free details than those present in the original HR image.
  • Figure 5: Qualitative comparison (809; DIV2K Val) between SeeSR+MD (PSNR$\uparrow$:25.107, SSIM$\uparrow$:0.598, LPIPS$\downarrow$:0.076) and our approach (PSNR$\uparrow$:25.627, SSIM$\uparrow$:0.621, LPIPS$\downarrow$:0.069). The global tags were "animal, break, floor, grass, green, lay, lion, lush, man, mane, mouth, relax, tree". Similarly, SeeSR+MD hallucinates fur-like patterns in the brown dirt, leading to artifacts that degrade the visual quality and contribute to its inferior performance compared to our approach, which preserves the natural texture of the dirt more effectively. Once again, our method generates finer, sharper details that surpass the level found in the HR image.
  • ...and 3 more figures