Table of Contents
Fetching ...

UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios

Tian Ye, Song Fei, Lei Zhu

TL;DR

UltraFlux addresses the challenges of native 4K text-to-image generation across diverse aspect ratios by engineering a data–model co-design that tightly couples (i) a large, AR-diverse 4K dataset (MultiAspect-4K-1M) with rich metadata, (ii) a Flux-based DiT backbone enhanced with Resonance 2D RoPE and YaRN for AR-aware positional encoding, (iii) a VAE post-training decoder (F16) for high-frequency fidelity, and (iv) a SNR-Aware Huber Wavelet objective with Stage-wise Aesthetic Curriculum Learning. Together, these components yield stable, high-fidelity 4K synthesis that generalizes across wide, square, and tall ARs, achieving state-of-the-art fidelity, aesthetics, and alignment on standard 4K benchmarks and competitive performance with proprietary methods when aided by a prompt refiner. The work demonstrates that jointly optimizing data, positional encoding, reconstruction, and loss design in the 4K regime yields non-additive gains that surpass isolated improvements, with practical training details and analyses to enable community replication. This regime-level framework advances open-source capabilities for high-quality 4K content and lays groundwork for broader data–model co-design in high-resolution generative modeling.

Abstract

Diffusion transformers have recently delivered strong text-to-image generation around 1K resolution, but we show that extending them to native 4K across diverse aspect ratios exposes a tightly coupled failure mode spanning positional encoding, VAE compression, and optimization. Tackling any of these factors in isolation leaves substantial quality on the table. We therefore take a data-model co-design view and introduce UltraFlux, a Flux-based DiT trained natively at 4K on MultiAspect-4K-1M, a 1M-image 4K corpus with controlled multi-AR coverage, bilingual captions, and rich VLM/IQA metadata for resolution- and AR-aware sampling. On the model side, UltraFlux couples (i) Resonance 2D RoPE with YaRN for training-window-, frequency-, and AR-aware positional encoding at 4K; (ii) a simple, non-adversarial VAE post-training scheme that improves 4K reconstruction fidelity; (iii) an SNR-Aware Huber Wavelet objective that rebalances gradients across timesteps and frequency bands; and (iv) a Stage-wise Aesthetic Curriculum Learning strategy that concentrates high-aesthetic supervision on high-noise steps governed by the model prior. Together, these components yield a stable, detail-preserving 4K DiT that generalizes across wide, square, and tall ARs. On the Aesthetic-Eval at 4096 benchmark and multi-AR 4K settings, UltraFlux consistently outperforms strong open-source baselines across fidelity, aesthetic, and alignment metrics, and-with a LLM prompt refiner-matches or surpasses the proprietary Seedream 4.0.

UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios

TL;DR

UltraFlux addresses the challenges of native 4K text-to-image generation across diverse aspect ratios by engineering a data–model co-design that tightly couples (i) a large, AR-diverse 4K dataset (MultiAspect-4K-1M) with rich metadata, (ii) a Flux-based DiT backbone enhanced with Resonance 2D RoPE and YaRN for AR-aware positional encoding, (iii) a VAE post-training decoder (F16) for high-frequency fidelity, and (iv) a SNR-Aware Huber Wavelet objective with Stage-wise Aesthetic Curriculum Learning. Together, these components yield stable, high-fidelity 4K synthesis that generalizes across wide, square, and tall ARs, achieving state-of-the-art fidelity, aesthetics, and alignment on standard 4K benchmarks and competitive performance with proprietary methods when aided by a prompt refiner. The work demonstrates that jointly optimizing data, positional encoding, reconstruction, and loss design in the 4K regime yields non-additive gains that surpass isolated improvements, with practical training details and analyses to enable community replication. This regime-level framework advances open-source capabilities for high-quality 4K content and lays groundwork for broader data–model co-design in high-resolution generative modeling.

Abstract

Diffusion transformers have recently delivered strong text-to-image generation around 1K resolution, but we show that extending them to native 4K across diverse aspect ratios exposes a tightly coupled failure mode spanning positional encoding, VAE compression, and optimization. Tackling any of these factors in isolation leaves substantial quality on the table. We therefore take a data-model co-design view and introduce UltraFlux, a Flux-based DiT trained natively at 4K on MultiAspect-4K-1M, a 1M-image 4K corpus with controlled multi-AR coverage, bilingual captions, and rich VLM/IQA metadata for resolution- and AR-aware sampling. On the model side, UltraFlux couples (i) Resonance 2D RoPE with YaRN for training-window-, frequency-, and AR-aware positional encoding at 4K; (ii) a simple, non-adversarial VAE post-training scheme that improves 4K reconstruction fidelity; (iii) an SNR-Aware Huber Wavelet objective that rebalances gradients across timesteps and frequency bands; and (iv) a Stage-wise Aesthetic Curriculum Learning strategy that concentrates high-aesthetic supervision on high-noise steps governed by the model prior. Together, these components yield a stable, detail-preserving 4K DiT that generalizes across wide, square, and tall ARs. On the Aesthetic-Eval at 4096 benchmark and multi-AR 4K settings, UltraFlux consistently outperforms strong open-source baselines across fidelity, aesthetic, and alignment metrics, and-with a LLM prompt refiner-matches or surpasses the proprietary Seedream 4.0.

Paper Structure

This paper contains 28 sections, 21 equations, 17 figures, 11 tables.

Figures (17)

  • Figure 1: Left: UltraFlux generates photorealistic 4K images across diverse aspect ratios and topics while maintaining high aesthetic quality and faithful content depiction with a single unified text-to-image model. Right: Our MultiAspect-4K-1M is a large-scale high-quality dataset for 4K image synthesis.
  • Figure 2: Data Pipeline overview.
  • Figure 3: Dataset example.
  • Figure 4: Dataset aspect and resolution analysis. All datasets use 10k samples. MultiAspect-4K-1M has a broader aspect ratio distribution.
  • Figure 5: Gemini-2.5-Flash preference comparison.
  • ...and 12 more figures