Table of Contents
Fetching ...

Turbo2K: Towards Ultra-Efficient and High-Quality 2K Video Synthesis

Jingjing Ren, Wenbo Li, Zhongdao Wang, Haoze Sun, Bangzhen Liu, Haoyu Chen, Jiaqi Xu, Aoxue Li, Shifeng Zhang, Bin Shao, Yong Guo, Lei Zhu

TL;DR

This work addresses the challenge of generating high-quality 2K videos with diffusion models, which is hampered by the quadratic complexity of attention and heavy token counts. Turbo2K combines (i) heterogeneous model distillation from a large teacher (e.g., 13B) to a smaller student operating in a higher-compression VAE latent space, (ii) a diffusion-based training objective that includes a distillation loss $\mathcal{L}_{\mathrm{dis}}$ and a diffusion loss $\mathcal{L}_{\mathrm{diff}}$, and (iii) a hierarchical two-stage synthesis where low-resolution semantic guidance extracted from multi-level DiT features guides high-resolution generation, avoiding costly LR decoding. The key contributions are: 1) a distillation framework that transfers generative capabilities across heterogeneous latent spaces, 2) a two-stage LR-to-HR synthesis leveraging feature-based guidance for structural coherence and fine detail, and 3) extensive experiments showing Turbo2K achieves 5-second, 24fps, 2K video generation with up to ~40× speedups over state-of-the-art methods while surpassing larger models on VBench. These findings demonstrate a practical pathway to scalable, high-quality 2K video synthesis suitable for real-world applications. $L_{dis}$ and $L_{diff}$ are jointly optimized to balance semantic fidelity and denoising accuracy, enabling efficient training and high-fidelity HR outputs.

Abstract

Demand for 2K video synthesis is rising with increasing consumer expectations for ultra-clear visuals. While diffusion transformers (DiTs) have demonstrated remarkable capabilities in high-quality video generation, scaling them to 2K resolution remains computationally prohibitive due to quadratic growth in memory and processing costs. In this work, we propose Turbo2K, an efficient and practical framework for generating detail-rich 2K videos while significantly improving training and inference efficiency. First, Turbo2K operates in a highly compressed latent space, reducing computational complexity and memory footprint, making high-resolution video synthesis feasible. However, the high compression ratio of the VAE and limited model size impose constraints on generative quality. To mitigate this, we introduce a knowledge distillation strategy that enables a smaller student model to inherit the generative capacity of a larger, more powerful teacher model. Our analysis reveals that, despite differences in latent spaces and architectures, DiTs exhibit structural similarities in their internal representations, facilitating effective knowledge transfer. Second, we design a hierarchical two-stage synthesis framework that first generates multi-level feature at lower resolutions before guiding high-resolution video generation. This approach ensures structural coherence and fine-grained detail refinement while eliminating redundant encoding-decoding overhead, further enhancing computational efficiency.Turbo2K achieves state-of-the-art efficiency, generating 5-second, 24fps, 2K videos with significantly reduced computational cost. Compared to existing methods, Turbo2K is up to 20$\times$ faster for inference, making high-resolution video generation more scalable and practical for real-world applications.

Turbo2K: Towards Ultra-Efficient and High-Quality 2K Video Synthesis

TL;DR

This work addresses the challenge of generating high-quality 2K videos with diffusion models, which is hampered by the quadratic complexity of attention and heavy token counts. Turbo2K combines (i) heterogeneous model distillation from a large teacher (e.g., 13B) to a smaller student operating in a higher-compression VAE latent space, (ii) a diffusion-based training objective that includes a distillation loss and a diffusion loss , and (iii) a hierarchical two-stage synthesis where low-resolution semantic guidance extracted from multi-level DiT features guides high-resolution generation, avoiding costly LR decoding. The key contributions are: 1) a distillation framework that transfers generative capabilities across heterogeneous latent spaces, 2) a two-stage LR-to-HR synthesis leveraging feature-based guidance for structural coherence and fine detail, and 3) extensive experiments showing Turbo2K achieves 5-second, 24fps, 2K video generation with up to ~40× speedups over state-of-the-art methods while surpassing larger models on VBench. These findings demonstrate a practical pathway to scalable, high-quality 2K video synthesis suitable for real-world applications. and are jointly optimized to balance semantic fidelity and denoising accuracy, enabling efficient training and high-fidelity HR outputs.

Abstract

Demand for 2K video synthesis is rising with increasing consumer expectations for ultra-clear visuals. While diffusion transformers (DiTs) have demonstrated remarkable capabilities in high-quality video generation, scaling them to 2K resolution remains computationally prohibitive due to quadratic growth in memory and processing costs. In this work, we propose Turbo2K, an efficient and practical framework for generating detail-rich 2K videos while significantly improving training and inference efficiency. First, Turbo2K operates in a highly compressed latent space, reducing computational complexity and memory footprint, making high-resolution video synthesis feasible. However, the high compression ratio of the VAE and limited model size impose constraints on generative quality. To mitigate this, we introduce a knowledge distillation strategy that enables a smaller student model to inherit the generative capacity of a larger, more powerful teacher model. Our analysis reveals that, despite differences in latent spaces and architectures, DiTs exhibit structural similarities in their internal representations, facilitating effective knowledge transfer. Second, we design a hierarchical two-stage synthesis framework that first generates multi-level feature at lower resolutions before guiding high-resolution video generation. This approach ensures structural coherence and fine-grained detail refinement while eliminating redundant encoding-decoding overhead, further enhancing computational efficiency.Turbo2K achieves state-of-the-art efficiency, generating 5-second, 24fps, 2K videos with significantly reduced computational cost. Compared to existing methods, Turbo2K is up to 20 faster for inference, making high-resolution video generation more scalable and practical for real-world applications.

Paper Structure

This paper contains 15 sections, 6 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Turbo2K generates high-quality, detail-rich, and aesthetically pleasing videos while achieving significant speed advantages over existing methods. Please refer to our supplementary file for more videos.
  • Figure 2: Visualization of internal feature structures across timesteps for different video diffusion models. The features exhibit similar underlying semantic structures.
  • Figure 3: Generated results at different resolutions. The distilled base model performs well at 540p and 720p but degrades at higher resolutions, while our method maintains rich details and structural coherence at 2K resolution.
  • Figure 4: Turbo2K framework overview. Left: Heterogeneous model distillation aligns the student model’s internal representation with a larger teacher model to enhance semantic understanding and detail richness. Right: Two-stage synthesis first generates a low-resolution (LR) video, extracting semantic features to guide high-resolution (HR) generation.
  • Figure 5: Visual comparison with video super-resolution methods. Our approach produces high-resolution results with finer details and stronger semantic coherence. Unlike existing VSR methods that heavily depend on the LR input, our method maintains semantic alignment while refining structural details, ensuring improved fidelity and consistency in HR synthesis.
  • ...and 6 more figures