Table of Contents
Fetching ...

Latent Space Super-Resolution for Higher-Resolution Image Generation with Diffusion Models

Jinho Jeong, Sangmin Han, Jinwoo Kim, Seon Joo Kim

TL;DR

This work tackles the challenge of generating very high-resolution images with diffusion models by addressing two key bottlenecks: manifold deviation during latent-space upsampling and insufficient texture in RGB upsampling. It introduces Latent Space Super-Resolution (LSR) to align low- and high-resolution latent manifolds and Region-wise Noise Addition (RNA) to inject detail in high-frequency regions, forming the LSRNA framework. Empirical results demonstrate that LSRNA improves both latent- and RGB-based reference methods (e.g., DemoFusion and Pixelsmith), achieving state-of-the-art scores across multiple resolutions with faster inference due to reduced denoising steps. The approach advances practical high-resolution diffusion-based generation and offers robust, edge-guided texture enhancement for real-world, megapixel-scale outputs.

Abstract

In this paper, we propose LSRNA, a novel framework for higher-resolution (exceeding 1K) image generation using diffusion models by leveraging super-resolution directly in the latent space. Existing diffusion models struggle with scaling beyond their training resolutions, often leading to structural distortions or content repetition. Reference-based methods address the issues by upsampling a low-resolution reference to guide higher-resolution generation. However, they face significant challenges: upsampling in latent space often causes manifold deviation, which degrades output quality. On the other hand, upsampling in RGB space tends to produce overly smoothed outputs. To overcome these limitations, LSRNA combines Latent space Super-Resolution (LSR) for manifold alignment and Region-wise Noise Addition (RNA) to enhance high-frequency details. Our extensive experiments demonstrate that integrating LSRNA outperforms state-of-the-art reference-based methods across various resolutions and metrics, while showing the critical role of latent space upsampling in preserving detail and sharpness. The code is available at https://github.com/3587jjh/LSRNA.

Latent Space Super-Resolution for Higher-Resolution Image Generation with Diffusion Models

TL;DR

This work tackles the challenge of generating very high-resolution images with diffusion models by addressing two key bottlenecks: manifold deviation during latent-space upsampling and insufficient texture in RGB upsampling. It introduces Latent Space Super-Resolution (LSR) to align low- and high-resolution latent manifolds and Region-wise Noise Addition (RNA) to inject detail in high-frequency regions, forming the LSRNA framework. Empirical results demonstrate that LSRNA improves both latent- and RGB-based reference methods (e.g., DemoFusion and Pixelsmith), achieving state-of-the-art scores across multiple resolutions with faster inference due to reduced denoising steps. The approach advances practical high-resolution diffusion-based generation and offers robust, edge-guided texture enhancement for real-world, megapixel-scale outputs.

Abstract

In this paper, we propose LSRNA, a novel framework for higher-resolution (exceeding 1K) image generation using diffusion models by leveraging super-resolution directly in the latent space. Existing diffusion models struggle with scaling beyond their training resolutions, often leading to structural distortions or content repetition. Reference-based methods address the issues by upsampling a low-resolution reference to guide higher-resolution generation. However, they face significant challenges: upsampling in latent space often causes manifold deviation, which degrades output quality. On the other hand, upsampling in RGB space tends to produce overly smoothed outputs. To overcome these limitations, LSRNA combines Latent space Super-Resolution (LSR) for manifold alignment and Region-wise Noise Addition (RNA) to enhance high-frequency details. Our extensive experiments demonstrate that integrating LSRNA outperforms state-of-the-art reference-based methods across various resolutions and metrics, while showing the critical role of latent space upsampling in preserving detail and sharpness. The code is available at https://github.com/3587jjh/LSRNA.

Paper Structure

This paper contains 28 sections, 5 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: Comparisons of 16$\times$ image generation with and without LSRNA framework. Our proposed LSRNA framework improves reference-based higher-resolution image generation, enhancing detail and sharpness beyond the native resolution of SDXL podell2023sdxl ($1024^2$) while achieving faster generation speeds.
  • Figure 2: Comparison of DemoFusion with different upsampling strategies. All methods are directly upsampled to 16$\times$ resolution. (a) Latent space bicubic upsampling causes manifold deviation, degrading output quality. (b) RGB space bicubic upsampling produces outputs with reduced detail and sharpness. (c) Our learned latent-space upsampling aligns the manifold, resulting in sharp and detailed outputs. Best viewed ZOOMED-IN.
  • Figure 3: Framework Comparison. (a) Existing latent upsampling framework rely on progressive upsampling to address manifold deviation. (b) Existing RGB upsampling framework can directly upsample (optionally progressively), but produce smooth output. (c) Our framework enables latent upsampling without progressive upscaling with much fewer denoising steps ($T_c<T$) while producing detailed outputs (RNA omitted for simplicity). LR, MR, HR: low/mid/high resolution; DM: Diffusion Model.
  • Figure 4: Overview of LSRNA. The proposed LSRNA enhances reference upsampling with Latent space Super-Resolution (LSR) and Region-wise Noise Addition (RNA). LSR directly maps the low-resolution reference latent onto the high-resolution manifold. RNA then injects region-adaptive noise into the mapped reference, guided by a canny edge map. RNA facilitates detail generation in the higher-resolution generation stage.
  • Figure 5: Qualitative comparisons across reference-based methods at 2K and 4K resolutions.
  • ...and 8 more figures