Table of Contents
Fetching ...

Helios: Real Real-Time Long Video Generation Model

Shenghai Yuan, Yuanyang Yin, Zongjian Li, Xinwei Huang, Xiao Yang, Li Yuan

TL;DR

Helios is the first 14B video generation model that runs at 19.5 FPS on a single NVIDIA H100 GPU and supports minute-scale generation while matching the quality of a strong baseline and infrastructure-level optimizations that accelerate both inference and training while reducing memory consumption are introduced.

Abstract

We introduce Helios, the first 14B video generation model that runs at 19.5 FPS on a single NVIDIA H100 GPU and supports minute-scale generation while matching the quality of a strong baseline. We make breakthroughs along three key dimensions: (1) robustness to long-video drifting without commonly used anti-drifting heuristics such as self-forcing, error-banks, or keyframe sampling; (2) real-time generation without standard acceleration techniques such as KV-cache, sparse/linear attention, or quantization; and (3) training without parallelism or sharding frameworks, enabling image-diffusion-scale batch sizes while fitting up to four 14B models within 80 GB of GPU memory. Specifically, Helios is a 14B autoregressive diffusion model with a unified input representation that natively supports T2V, I2V, and V2V tasks. To mitigate drifting in long-video generation, we characterize typical failure modes and propose simple yet effective training strategies that explicitly simulate drifting during training, while eliminating repetitive motion at its source. For efficiency, we heavily compress the historical and noisy context and reduce the number of sampling steps, yielding computational costs comparable to -- or lower than -- those of 1.3B video generative models. Moreover, we introduce infrastructure-level optimizations that accelerate both inference and training while reducing memory consumption. Extensive experiments demonstrate that Helios consistently outperforms prior methods on both short- and long-video generation. We plan to release the code, base model, and distilled model to support further development by the community.

Helios: Real Real-Time Long Video Generation Model

TL;DR

Helios is the first 14B video generation model that runs at 19.5 FPS on a single NVIDIA H100 GPU and supports minute-scale generation while matching the quality of a strong baseline and infrastructure-level optimizations that accelerate both inference and training while reducing memory consumption are introduced.

Abstract

We introduce Helios, the first 14B video generation model that runs at 19.5 FPS on a single NVIDIA H100 GPU and supports minute-scale generation while matching the quality of a strong baseline. We make breakthroughs along three key dimensions: (1) robustness to long-video drifting without commonly used anti-drifting heuristics such as self-forcing, error-banks, or keyframe sampling; (2) real-time generation without standard acceleration techniques such as KV-cache, sparse/linear attention, or quantization; and (3) training without parallelism or sharding frameworks, enabling image-diffusion-scale batch sizes while fitting up to four 14B models within 80 GB of GPU memory. Specifically, Helios is a 14B autoregressive diffusion model with a unified input representation that natively supports T2V, I2V, and V2V tasks. To mitigate drifting in long-video generation, we characterize typical failure modes and propose simple yet effective training strategies that explicitly simulate drifting during training, while eliminating repetitive motion at its source. For efficiency, we heavily compress the historical and noisy context and reduce the number of sampling steps, yielding computational costs comparable to -- or lower than -- those of 1.3B video generative models. Moreover, we introduce infrastructure-level optimizations that accelerate both inference and training while reducing memory consumption. Extensive experiments demonstrate that Helios consistently outperforms prior methods on both short- and long-video generation. We plan to release the code, base model, and distilled model to support further development by the community.
Paper Structure (49 sections, 22 equations, 23 figures, 7 tables)

This paper contains 49 sections, 22 equations, 23 figures, 7 tables.

Figures (23)

  • Figure 1: End-to-end throughput (FPS) of various video generation models on a single H100. The results are obtained at the same resolution with all official acceleration techniques, including FlashAttention, torch compile, and KV-cache. Helios is substantially faster than models at the same scale and matches the speed of smaller distilled ones.
  • Figure 2: Benchmark performance of Helios and its counterparts. For both short- and long-video generation, Helios consistently outperforms existing distilled models while achieving performance comparable to that of base models.
  • Figure 3: Showcases of infinite videos generated by Helios. Despite overhead comparable to that of the 1.3B models wanlongliverollingforcingrewardforcingcausalforcing, Helios still excels in visual quality, text alignment, and motion dynamics.
  • Figure 4: Architecture of Helios. Helios is an autoregressive video diffusion transformer built with Guidance Attention blocks. It reduces overhead by compressing historical and noisy context through Multi-Term Memory Patchification and Pyramid Unified Predictor Corrector, while unifying T2V, I2V, and V2V tasks via Representation Control.
  • Figure 5: Visualization of three representative drifting patterns in long-video generation.
  • ...and 18 more figures