Table of Contents
Fetching ...

ZipIR: Latent Pyramid Diffusion Transformer for High-Resolution Image Restoration

Yongsheng Yu, Haitian Zheng, Zhifei Zhang, Jianming Zhang, Yuqian Zhou, Connelly Barnes, Yuchen Liu, Wei Xiong, Zhe Lin, Jiebo Luo

TL;DR

ZipIR tackles the challenge of ultra-high-resolution image restoration by combining a Latent Pyramid VAE (LP-VAE) with a high-capacity Diffusion Transformer (DiT) that operates on a compact latent representation of the full image. By downsampling the input to a $32×$ latent space, ZipIR reduces token counts and quadratic attention costs, enabling training on full $2K$ images with a 3B-parameter DiT and a pixel-aware decoder to preserve fine details. Empirical results on 16× and 8× restoration across degraded inputs show that ZipIR achieves faster inference (about $6.9$ s per image at $2048^2$) and superior perceptual quality (lower LPIPS/FID) than prior diffusion-based IR methods, while maintaining robust performance on real-world LQ data. These findings suggest ZipIR offers a practical, scalable pathway for high-fidelity restoration at ultra-high resolutions and lays groundwork for even larger models and higher compression in future work.

Abstract

Recent progress in generative models has significantly improved image restoration capabilities, particularly through powerful diffusion models that offer remarkable recovery of semantic details and local fidelity. However, deploying these models at ultra-high resolutions faces a critical trade-off between quality and efficiency due to the computational demands of long-range attention mechanisms. To address this, we introduce ZipIR, a novel framework that enhances efficiency, scalability, and long-range modeling for high-res image restoration. ZipIR employs a highly compressed latent representation that compresses image 32x, effectively reducing the number of spatial tokens, and enabling the use of high-capacity models like the Diffusion Transformer (DiT). Toward this goal, we propose a Latent Pyramid VAE (LP-VAE) design that structures the latent space into sub-bands to ease diffusion training. Trained on full images up to 2K resolution, ZipIR surpasses existing diffusion-based methods, offering unmatched speed and quality in restoring high-resolution images from severely degraded inputs.

ZipIR: Latent Pyramid Diffusion Transformer for High-Resolution Image Restoration

TL;DR

ZipIR tackles the challenge of ultra-high-resolution image restoration by combining a Latent Pyramid VAE (LP-VAE) with a high-capacity Diffusion Transformer (DiT) that operates on a compact latent representation of the full image. By downsampling the input to a latent space, ZipIR reduces token counts and quadratic attention costs, enabling training on full images with a 3B-parameter DiT and a pixel-aware decoder to preserve fine details. Empirical results on 16× and 8× restoration across degraded inputs show that ZipIR achieves faster inference (about s per image at ) and superior perceptual quality (lower LPIPS/FID) than prior diffusion-based IR methods, while maintaining robust performance on real-world LQ data. These findings suggest ZipIR offers a practical, scalable pathway for high-fidelity restoration at ultra-high resolutions and lays groundwork for even larger models and higher compression in future work.

Abstract

Recent progress in generative models has significantly improved image restoration capabilities, particularly through powerful diffusion models that offer remarkable recovery of semantic details and local fidelity. However, deploying these models at ultra-high resolutions faces a critical trade-off between quality and efficiency due to the computational demands of long-range attention mechanisms. To address this, we introduce ZipIR, a novel framework that enhances efficiency, scalability, and long-range modeling for high-res image restoration. ZipIR employs a highly compressed latent representation that compresses image 32x, effectively reducing the number of spatial tokens, and enabling the use of high-capacity models like the Diffusion Transformer (DiT). Toward this goal, we propose a Latent Pyramid VAE (LP-VAE) design that structures the latent space into sub-bands to ease diffusion training. Trained on full images up to 2K resolution, ZipIR surpasses existing diffusion-based methods, offering unmatched speed and quality in restoring high-resolution images from severely degraded inputs.

Paper Structure

This paper contains 21 sections, 26 figures, 6 tables.

Figures (26)

  • Figure 1: 20$\times$ super-resolution at $2048^2$ px resolution.
  • Figure 2: 16$\times$ super-resolution at $2048^2$ px resolution.
  • Figure 3: 8$\times$ restoration at $2048^2$ px resolution.
  • Figure 4: Inference time at $2048^2$ px resolution (in seconds).
  • Figure 5: Model scalability measured by the diffusion model parameters (in Millions).
  • ...and 21 more figures