Table of Contents
Fetching ...

One Small Step in Latent, One Giant Leap for Pixels: Fast Latent Upscale Adapter for Your Diffusion Models

Aleksandr Razin, Danil Kazantsev, Ilya Makarov

TL;DR

Diffusion models struggle to synthesize beyond training resolutions due to slow high-resolution decoding and post-hoc SR artifacts. The authors propose Latent Upscaler Adapter (LUA), a lightweight, drop-in module that upscales latent representations by factors $2$ or $4$ in latent space before decoding with a frozen VAE, enabling single-pass, high-resolution output without retraining the generator. Trained with a three-stage curriculum (latent-domain alignment, joint latent–pixel consistency, and edge-aware refinement) on a SwinIR-style backbone, LUA generalizes across VAEs (e.g., FLUX, SD3, SDXL) with minimal fine-tuning and achieves favorable latency-quality trade-offs compared to native high-resolution diffusion and pixel-space SR. Empirical results on OpenImages show state-of-the-art single-decode fidelity at $2048^2$ and $4096^2$ resolutions with the fastest runtimes among comparable approaches, while maintaining robust cross-model and multi-scale generalization. Overall, LUA offers a practical, deployment-friendly path to scalable, high-fidelity image synthesis in modern diffusion pipelines.

Abstract

Diffusion models struggle to scale beyond their training resolutions, as direct high-resolution sampling is slow and costly, while post-hoc image super-resolution (ISR) introduces artifacts and additional latency by operating after decoding. We present the Latent Upscaler Adapter (LUA), a lightweight module that performs super-resolution directly on the generator's latent code before the final VAE decoding step. LUA integrates as a drop-in component, requiring no modifications to the base model or additional diffusion stages, and enables high-resolution synthesis through a single feed-forward pass in latent space. A shared Swin-style backbone with scale-specific pixel-shuffle heads supports 2x and 4x factors and remains compatible with image-space SR baselines, achieving comparable perceptual quality with nearly 3x lower decoding and upscaling time (adding only +0.42 s for 1024 px generation from 512 px, compared to 1.87 s for pixel-space SR using the same SwinIR architecture). Furthermore, LUA shows strong generalization across the latent spaces of different VAEs, making it easy to deploy without retraining from scratch for each new decoder. Extensive experiments demonstrate that LUA closely matches the fidelity of native high-resolution generation while offering a practical and efficient path to scalable, high-fidelity image synthesis in modern diffusion pipelines.

One Small Step in Latent, One Giant Leap for Pixels: Fast Latent Upscale Adapter for Your Diffusion Models

TL;DR

Diffusion models struggle to synthesize beyond training resolutions due to slow high-resolution decoding and post-hoc SR artifacts. The authors propose Latent Upscaler Adapter (LUA), a lightweight, drop-in module that upscales latent representations by factors or in latent space before decoding with a frozen VAE, enabling single-pass, high-resolution output without retraining the generator. Trained with a three-stage curriculum (latent-domain alignment, joint latent–pixel consistency, and edge-aware refinement) on a SwinIR-style backbone, LUA generalizes across VAEs (e.g., FLUX, SD3, SDXL) with minimal fine-tuning and achieves favorable latency-quality trade-offs compared to native high-resolution diffusion and pixel-space SR. Empirical results on OpenImages show state-of-the-art single-decode fidelity at and resolutions with the fastest runtimes among comparable approaches, while maintaining robust cross-model and multi-scale generalization. Overall, LUA offers a practical, deployment-friendly path to scalable, high-fidelity image synthesis in modern diffusion pipelines.

Abstract

Diffusion models struggle to scale beyond their training resolutions, as direct high-resolution sampling is slow and costly, while post-hoc image super-resolution (ISR) introduces artifacts and additional latency by operating after decoding. We present the Latent Upscaler Adapter (LUA), a lightweight module that performs super-resolution directly on the generator's latent code before the final VAE decoding step. LUA integrates as a drop-in component, requiring no modifications to the base model or additional diffusion stages, and enables high-resolution synthesis through a single feed-forward pass in latent space. A shared Swin-style backbone with scale-specific pixel-shuffle heads supports 2x and 4x factors and remains compatible with image-space SR baselines, achieving comparable perceptual quality with nearly 3x lower decoding and upscaling time (adding only +0.42 s for 1024 px generation from 512 px, compared to 1.87 s for pixel-space SR using the same SwinIR architecture). Furthermore, LUA shows strong generalization across the latent spaces of different VAEs, making it easy to deploy without retraining from scratch for each new decoder. Extensive experiments demonstrate that LUA closely matches the fidelity of native high-resolution generation while offering a practical and efficient path to scalable, high-fidelity image synthesis in modern diffusion pipelines.

Paper Structure

This paper contains 47 sections, 21 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Our proposed lightweight Latent Upscaler Adapter (LUA) integrates into diffusion pipelines without retraining the generator/decoder and without an extra diffusion stage. The example uses a FLUX batifol2025flux generator: it produces a $64{\times}64$ latent for a $512$ px image (red dashed path decodes directly). Our path (green dashed) upsamples the same latent to $128{\times}128$ ($\times2$) or $256{\times}256$ ($\times4$) and decodes once to $1024$ px or $2048$ px, adding only $+0.42$ s (1K) and $+2.21$ s (2K) on an NVIDIA L40S GPU. LUA outperforms multi-stage high-resolution pipelines while avoiding their extra diffusion passes, and achieves efficiency competitive with image-space SR at comparable perceptual quality, all via a single final decode.
  • Figure 2: Upscaling FLUX outputs batifol2025flux from $1024^2{\rightarrow}2048^2$. Columns: (1) base decode, (2) bicubic latent, (3) SwinIR image-space SR, (4) LUA latent-space SR. Top: runtime overhead vs. (1). Middle ($8\times$ crops): bicubic blurs/aliases; SwinIR sharpens but adds noise/texture drift; LUA preserves eyelashes and skin with stable edges. Bottom: Laplacian-variance maps (darker = less noise) with means—LUA attains the lowest residual noise and the smallest overhead via single-decode latent upscaling.
  • Figure 3: Cross-model $2\times$ latent upscaling with a single adapter. For SDXL podell2023sdxlimprovinglatentdiffusion, SD3 esser2024scaling, and FLUX batifol2025flux, a $128{\times}128$ latent is upscaled to $256{\times}256$ by the same LUA and decoded once by each model’s native VAE to yield $2048^2$ images. SD3 and FLUX share $C{=}16$ latents; SDXL ($C{=}4$) is supported by changing only the first convolution. Insets show artifact-free detail preservation; green boxes mark $\times8$ zooms.
  • Figure 4: Architecture of the Latent Upscaler Adapter (LUA). A SwinIR-style backbone liang2021swinir is shared across scales; a $1{\times}1$ input conv adapts the VAE latent width ($C{=}16$ for FLUX/SD3; $C{=}4$ for SDXL). Scale-specific pixel-shuffle heads output $\times2$ or $\times4$ latents. At inference, the path selects the input adapter, runs the shared backbone, and activates the requested head. The schematic shows FLUX/SD3 $\times2$ and SDXL $\times4$.
  • Figure 5: Effect of the three-stage curriculum on latent reconstruction and decoded appearance (FLUX backbone). The $2{\times}4$ grid shows top: latent feature maps (channel 10, min–max normalized); bottom: corresponding $8{\times}$ zoomed decodes. Columns: (1) original low-resolution latent ($128^2$) and decode; (2–4) LUA upscaled latents to $256^2$ after Stage I–III with their decodes. Yellow boxes mark the zoomed region. From (2) to (4), decodes become less noisy and more structured; Stage III concentrates high-frequency energy around details, indicating that controlled latent noise aids faithful VAE decoding.
  • ...and 8 more figures