Table of Contents
Fetching ...

LUVE : Latent-Cascaded Ultra-High-Resolution Video Generation with Dual Frequency Experts

Chen Zhao, Jiawei Chen, Hongyu Li, Zhuoliang Kang, Shilin Lu, Xiaoming Wei, Kai Zhang, Jian Yang, Ying Tai

TL;DR

LUVE tackles the challenge of ultra-high-resolution video generation by decomposing the process into three synergistic stages: low-resolution motion generation to establish robust motion priors, video latent upsampling to scale in the latent space efficiently, and high-resolution content refinement driven by dual-frequency experts to enhance both semantic coherence and fine textures. The low-frequency expert focuses on global semantic fidelity, while the high-frequency expert sharpens details, each trained with targeted data curation and parameter-efficient LoRA adapters. A lightweight VLUer enables continuous, scalable upsampling within the latent domain, supervised by a combination of latent and pixel-level losses to ensure temporal coherence. Extensive experiments on UltraVideo-derived benchmarks show LUVE achieving state-of-the-art realism, detail, and alignment for UHR video generation, with ablations confirming the necessity and complementarity of each component and efficiency advantages over end-to-end approaches. This work advances practical UHR video synthesis by reducing memory overhead and improving semantic and textural quality, with potential impact on digital humans, AR/VR, and cinematic content production.

Abstract

Recent advances in video diffusion models have significantly improved visual quality, yet ultra-high-resolution (UHR) video generation remains a formidable challenge due to the compounded difficulties of motion modeling, semantic planning, and detail synthesis. To address these limitations, we propose \textbf{LUVE}, a \textbf{L}atent-cascaded \textbf{U}HR \textbf{V}ideo generation framework built upon dual frequency \textbf{E}xperts. LUVE employs a three-stage architecture comprising low-resolution motion generation for motion-consistent latent synthesis, video latent upsampling that performs resolution upsampling directly in the latent space to mitigate memory and computational overhead, and high-resolution content refinement that integrates low-frequency and high-frequency experts to jointly enhance semantic coherence and fine-grained detail generation. Extensive experiments demonstrate that our LUVE achieves superior photorealism and content fidelity in UHR video generation, and comprehensive ablation studies further validate the effectiveness of each component. The project is available at \href{https://unicornanrocinu.github.io/LUVE_web/}{https://github.io/LUVE/}.

LUVE : Latent-Cascaded Ultra-High-Resolution Video Generation with Dual Frequency Experts

TL;DR

LUVE tackles the challenge of ultra-high-resolution video generation by decomposing the process into three synergistic stages: low-resolution motion generation to establish robust motion priors, video latent upsampling to scale in the latent space efficiently, and high-resolution content refinement driven by dual-frequency experts to enhance both semantic coherence and fine textures. The low-frequency expert focuses on global semantic fidelity, while the high-frequency expert sharpens details, each trained with targeted data curation and parameter-efficient LoRA adapters. A lightweight VLUer enables continuous, scalable upsampling within the latent domain, supervised by a combination of latent and pixel-level losses to ensure temporal coherence. Extensive experiments on UltraVideo-derived benchmarks show LUVE achieving state-of-the-art realism, detail, and alignment for UHR video generation, with ablations confirming the necessity and complementarity of each component and efficiency advantages over end-to-end approaches. This work advances practical UHR video synthesis by reducing memory overhead and improving semantic and textural quality, with potential impact on digital humans, AR/VR, and cinematic content production.

Abstract

Recent advances in video diffusion models have significantly improved visual quality, yet ultra-high-resolution (UHR) video generation remains a formidable challenge due to the compounded difficulties of motion modeling, semantic planning, and detail synthesis. To address these limitations, we propose \textbf{LUVE}, a \textbf{L}atent-cascaded \textbf{U}HR \textbf{V}ideo generation framework built upon dual frequency \textbf{E}xperts. LUVE employs a three-stage architecture comprising low-resolution motion generation for motion-consistent latent synthesis, video latent upsampling that performs resolution upsampling directly in the latent space to mitigate memory and computational overhead, and high-resolution content refinement that integrates low-frequency and high-frequency experts to jointly enhance semantic coherence and fine-grained detail generation. Extensive experiments demonstrate that our LUVE achieves superior photorealism and content fidelity in UHR video generation, and comprehensive ablation studies further validate the effectiveness of each component. The project is available at \href{https://unicornanrocinu.github.io/LUVE_web/}{https://github.io/LUVE/}.
Paper Structure (25 sections, 5 equations, 11 figures, 13 tables)

This paper contains 25 sections, 5 equations, 11 figures, 13 tables.

Figures (11)

  • Figure 1: The base corresponds to the pretrained T2V model used in the first stage of our framework wan2025wan. As shown, compared with existing VSR methods, our model not only produces videos that are noticeably sharper and richer in fine details, but more importantly, it significantly enhances semantic consistency and plausibility. This demonstrates that UHR generation goes beyond merely enhancing visual sharpness—it fundamentally advances semantic coherence and content fidelity. (Zoom-in for best view)
  • Figure 2: Scaling T2V models to UHR scenarios introduces several challenges. In motion modeling, models tend to produce static outputs, failing to capture coherent temporal dynamics. In semantic planning, both global and local repetitions emerge, reflecting insufficient semantic understanding. Finally, in detail synthesis, the generated frames often suffer from motion blur and texture degradation.
  • Figure 3: Overview of the LUVE framework. (a) and (b) illustrate the core distinction between existing cascaded high-resolution video generation architectures and our proposed paradigm. While previous methods focus on high-resolution detail refinement, our approach prioritizes high-resolution content and semantic fidelity. (c) Our LUVE, which consists of three collaborative stages: low-resolution motion generation (LMG), video latent upsampling (VLU), and high-resolution content refinement (HCR).
  • Figure 4: Framework Comparison. (a) Existing latent interpolation framework. (b) Existing RGB interpolation framework. (c) Our framework based on our video latent upsampler (VLUer)).
  • Figure 5: Visual analysis of the key components in VLUer. These results demonstrate that our decoder effectively alleviates blurriness, while our $\mathcal{L}_{\text{pixel}}$ successfully mitigates blocky artifacts.
  • ...and 6 more figures