Table of Contents
Fetching ...

PixelRush: Ultra-Fast, Training-Free High-Resolution Image Generation via One-step Diffusion

Hong-Phuc Lai, Phong Nguyen, Anh Tran

TL;DR

PixelRush is presented, the first tuning-free framework for practical high-resolution text-to-image generation that builds upon the established patch-based inference paradigm but eliminates the need for multiple inversion and regeneration cycles.

Abstract

Pre-trained diffusion models excel at generating high-quality images but remain inherently limited by their native training resolution. Recent training-free approaches have attempted to overcome this constraint by introducing interventions during the denoising process; however, these methods incur substantial computational overhead, often requiring more than five minutes to produce a single 4K image. In this paper, we present PixelRush, the first tuning-free framework for practical high-resolution text-to-image generation. Our method builds upon the established patch-based inference paradigm but eliminates the need for multiple inversion and regeneration cycles. Instead, PixelRush enables efficient patch-based denoising within a low-step regime. To address artifacts introduced by patch blending in few-step generation, we propose a seamless blending strategy. Furthermore, we mitigate over-smoothing effects through a noise injection mechanism. PixelRush delivers exceptional efficiency, generating 4K images in approximately 20 seconds representing a 10$\times$ to 35$\times$ speedup over state-of-the-art methods while maintaining superior visual fidelity. Extensive experiments validate both the performance gains and the quality of outputs achieved by our approach.

PixelRush: Ultra-Fast, Training-Free High-Resolution Image Generation via One-step Diffusion

TL;DR

PixelRush is presented, the first tuning-free framework for practical high-resolution text-to-image generation that builds upon the established patch-based inference paradigm but eliminates the need for multiple inversion and regeneration cycles.

Abstract

Pre-trained diffusion models excel at generating high-quality images but remain inherently limited by their native training resolution. Recent training-free approaches have attempted to overcome this constraint by introducing interventions during the denoising process; however, these methods incur substantial computational overhead, often requiring more than five minutes to produce a single 4K image. In this paper, we present PixelRush, the first tuning-free framework for practical high-resolution text-to-image generation. Our method builds upon the established patch-based inference paradigm but eliminates the need for multiple inversion and regeneration cycles. Instead, PixelRush enables efficient patch-based denoising within a low-step regime. To address artifacts introduced by patch blending in few-step generation, we propose a seamless blending strategy. Furthermore, we mitigate over-smoothing effects through a noise injection mechanism. PixelRush delivers exceptional efficiency, generating 4K images in approximately 20 seconds representing a 10 to 35 speedup over state-of-the-art methods while maintaining superior visual fidelity. Extensive experiments validate both the performance gains and the quality of outputs achieved by our approach.
Paper Structure (23 sections, 6 equations, 8 figures, 4 tables)

This paper contains 23 sections, 6 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Unlocking High Resolution with PixelRush. Our tuning-free method leverage the pretrained text-to-image models to generate high-fidelity, high-resolution images. In these examples, we extend the base model SDXL podell2023sdxl to generate $4K$ resolution images. Each image was produced on a single A100-40GB GPU in under 20 seconds, demonstrating state-of-the-art quality and efficiency. Best viewed ZOOMED-IN.
  • Figure 2: An overview of two-stage system with for high-resolution generation with Cascade Upsampling. (a) Two-Stage System. A base diffusion model generates a low-resolution, base image. This image then goes into a cascade upsampling process to progressively upscale to target resolution. (b) The Cascade Step. Each cascade step doubles the height and width of an input image. First, the initial image at resolution $R$ is upscaled to $4R$ via interpolation in pixel space, creating a coarse image. This coarse image is then encoded by VAE encoder to obtain a coarse latent. This coarse latent is enhanced with high-frequency details synthesized in our Refinement Stage, yield a high-quality latent. Finally, this refined latent is decoded back to the pixel space, producing a sharp, high-fidelity image at resolution $4R$.
  • Figure 3: The PixelRush Refinement Stage. Our refinement stage takes a coarse latent as input and first divides it into overlapping patches. These patches pass through proposed partial inversion few-step pipeline (Sec. \ref{['subsec:speed-up']}), where DDIM inversion maps each patch to an intermediate noisy latent. A few-step diffusion model then refines these latents, synthesizing high-frequency details. Finally, the refined patches are processed by our Gaussian Filter Patches Blending & Noise Injection module (Sec. \ref{['subsec:feathering']} + Sec. \ref{['subsec:noise_inject']}) to produce a seamless, high-quality latent.
  • Figure 4: Training-free high-resolution pipeline synthesize images hierarchically.
  • Figure 5: Detail enhancement and artifact emergence. Compared to the original 1K image, 2K image produced by our partial inversion few-step pipeline \ref{['subsec:speed-up']} successfully synthesizes high-fidelity details (green box). However, this process also introduces checkerboard (red box) and over-smoothing (yellow box) artifacts. Best viewed ZOOMED-IN.
  • ...and 3 more figures