LUVE : Latent-Cascaded Ultra-High-Resolution Video Generation with Dual Frequency Experts
Chen Zhao, Jiawei Chen, Hongyu Li, Zhuoliang Kang, Shilin Lu, Xiaoming Wei, Kai Zhang, Jian Yang, Ying Tai
TL;DR
LUVE tackles the challenge of ultra-high-resolution video generation by decomposing the process into three synergistic stages: low-resolution motion generation to establish robust motion priors, video latent upsampling to scale in the latent space efficiently, and high-resolution content refinement driven by dual-frequency experts to enhance both semantic coherence and fine textures. The low-frequency expert focuses on global semantic fidelity, while the high-frequency expert sharpens details, each trained with targeted data curation and parameter-efficient LoRA adapters. A lightweight VLUer enables continuous, scalable upsampling within the latent domain, supervised by a combination of latent and pixel-level losses to ensure temporal coherence. Extensive experiments on UltraVideo-derived benchmarks show LUVE achieving state-of-the-art realism, detail, and alignment for UHR video generation, with ablations confirming the necessity and complementarity of each component and efficiency advantages over end-to-end approaches. This work advances practical UHR video synthesis by reducing memory overhead and improving semantic and textural quality, with potential impact on digital humans, AR/VR, and cinematic content production.
Abstract
Recent advances in video diffusion models have significantly improved visual quality, yet ultra-high-resolution (UHR) video generation remains a formidable challenge due to the compounded difficulties of motion modeling, semantic planning, and detail synthesis. To address these limitations, we propose \textbf{LUVE}, a \textbf{L}atent-cascaded \textbf{U}HR \textbf{V}ideo generation framework built upon dual frequency \textbf{E}xperts. LUVE employs a three-stage architecture comprising low-resolution motion generation for motion-consistent latent synthesis, video latent upsampling that performs resolution upsampling directly in the latent space to mitigate memory and computational overhead, and high-resolution content refinement that integrates low-frequency and high-frequency experts to jointly enhance semantic coherence and fine-grained detail generation. Extensive experiments demonstrate that our LUVE achieves superior photorealism and content fidelity in UHR video generation, and comprehensive ablation studies further validate the effectiveness of each component. The project is available at \href{https://unicornanrocinu.github.io/LUVE_web/}{https://github.io/LUVE/}.
