Table of Contents
Fetching ...

Diffusion-4K: Ultra-High-Resolution Image Synthesis with Latent Diffusion Models

Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, Di Huang

TL;DR

Diffusion-4K introduces a dedicated Aesthetic-4K benchmark and a wavelet-based fine-tuning approach to enable direct 4K image synthesis with latent diffusion models. By pairing a high-quality 4K dataset (Aesthetic-4K) and novel fine-grained metrics (GLCM Score, Compression Ratio) with a memory-efficient training strategy (Partitioned VAE and Haar-wavelet latent enhancement), the method achieves superior 4K fidelity and prompt adherence on SD3-2B and Flux-12B. The work demonstrates substantial improvements over prior 4K attempts and provides a practical framework for scalable, high-frequency detail-rich 4K generation, with broad implications for high-resolution content creation and benchmarking. Overall, Diffusion-4K offers a robust path to realistic 4K synthesis and a public benchmark to accelerate future research in ultra-high-resolution generative modeling.

Abstract

In this paper, we present Diffusion-4K, a novel framework for direct ultra-high-resolution image synthesis using text-to-image diffusion models. The core advancements include: (1) Aesthetic-4K Benchmark: addressing the absence of a publicly available 4K image synthesis dataset, we construct Aesthetic-4K, a comprehensive benchmark for ultra-high-resolution image generation. We curated a high-quality 4K dataset with carefully selected images and captions generated by GPT-4o. Additionally, we introduce GLCM Score and Compression Ratio metrics to evaluate fine details, combined with holistic measures such as FID, Aesthetics and CLIPScore for a comprehensive assessment of ultra-high-resolution images. (2) Wavelet-based Fine-tuning: we propose a wavelet-based fine-tuning approach for direct training with photorealistic 4K images, applicable to various latent diffusion models, demonstrating its effectiveness in synthesizing highly detailed 4K images. Consequently, Diffusion-4K achieves impressive performance in high-quality image synthesis and text prompt adherence, especially when powered by modern large-scale diffusion models (e.g., SD3-2B and Flux-12B). Extensive experimental results from our benchmark demonstrate the superiority of Diffusion-4K in ultra-high-resolution image synthesis.

Diffusion-4K: Ultra-High-Resolution Image Synthesis with Latent Diffusion Models

TL;DR

Diffusion-4K introduces a dedicated Aesthetic-4K benchmark and a wavelet-based fine-tuning approach to enable direct 4K image synthesis with latent diffusion models. By pairing a high-quality 4K dataset (Aesthetic-4K) and novel fine-grained metrics (GLCM Score, Compression Ratio) with a memory-efficient training strategy (Partitioned VAE and Haar-wavelet latent enhancement), the method achieves superior 4K fidelity and prompt adherence on SD3-2B and Flux-12B. The work demonstrates substantial improvements over prior 4K attempts and provides a practical framework for scalable, high-frequency detail-rich 4K generation, with broad implications for high-resolution content creation and benchmarking. Overall, Diffusion-4K offers a robust path to realistic 4K synthesis and a public benchmark to accelerate future research in ultra-high-resolution generative modeling.

Abstract

In this paper, we present Diffusion-4K, a novel framework for direct ultra-high-resolution image synthesis using text-to-image diffusion models. The core advancements include: (1) Aesthetic-4K Benchmark: addressing the absence of a publicly available 4K image synthesis dataset, we construct Aesthetic-4K, a comprehensive benchmark for ultra-high-resolution image generation. We curated a high-quality 4K dataset with carefully selected images and captions generated by GPT-4o. Additionally, we introduce GLCM Score and Compression Ratio metrics to evaluate fine details, combined with holistic measures such as FID, Aesthetics and CLIPScore for a comprehensive assessment of ultra-high-resolution images. (2) Wavelet-based Fine-tuning: we propose a wavelet-based fine-tuning approach for direct training with photorealistic 4K images, applicable to various latent diffusion models, demonstrating its effectiveness in synthesizing highly detailed 4K images. Consequently, Diffusion-4K achieves impressive performance in high-quality image synthesis and text prompt adherence, especially when powered by modern large-scale diffusion models (e.g., SD3-2B and Flux-12B). Extensive experimental results from our benchmark demonstrate the superiority of Diffusion-4K in ultra-high-resolution image synthesis.

Paper Structure

This paper contains 14 sections, 4 equations, 16 figures, 11 tables.

Figures (16)

  • Figure 1: Example results synthesized by our Diffusion-4K, emphasizing exceptional fine details in generated 4K images.
  • Figure 2: Analysis of GLCM Score$\uparrow$ / Compression Ratio$\downarrow$. Our indicators demonstrate strong alignment with human-centric perceptual cognition at the level of local patches.
  • Figure 3: Illustration of image-text samples in the Aesthetic-4K dataset, which includes high-quality images and precise text prompts generated by GPT-4o, distinguished by high aesthetics and fine details.
  • Figure 4: Reconstruction results of 4K images with partitioned VAEs of $F=16$.
  • Figure 5: Qualitative 4K images synthesis of Diffusion-4K. Prompts are from Sora Sora:2024:Online.
  • ...and 11 more figures