Table of Contents
Fetching ...

LoViC: Efficient Long Video Generation with Context Compression

Jiaxiu Jiang, Wenbo Li, Jingjing Ren, Yuping Qiu, Yong Guo, Xiaogang Xu, Han Wu, Wangmeng Zuo

TL;DR

<3-5 sentence high-level summary> LoViC tackles long-form video generation with diffusion transformers by introducing a context-compression pipeline that reduces quadratic self-attention costs. It couples a flexible FlexFormer autoencoder with a single learnable query token and Interpolated-RoPE to compress arbitrary-length video-text context, enabling segment-wise generation for prediction, interpolation, retrodiction, and multi-shot tasks. The approach is trained on a million-scale open-domain video corpus and demonstrates improved temporal coherence and scalability compared to strong baselines, while maintaining competitive non-reference quality with significantly fewer parameters. This work advances practical long-range video synthesis by enabling flexible context conditioning and unified handling of multiple generation tasks within a single DiT-based framework.

Abstract

Despite recent advances in diffusion transformers (DiTs) for text-to-video generation, scaling to long-duration content remains challenging due to the quadratic complexity of self-attention. While prior efforts -- such as sparse attention and temporally autoregressive models -- offer partial relief, they often compromise temporal coherence or scalability. We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos, designed to produce long, coherent videos through a segment-wise generation process. At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations. It supports variable-length inputs with linearly adjustable compression rates, enabled by a single query token design based on the Q-Former architecture. Additionally, by encoding temporal context through position-aware mechanisms, our model seamlessly supports prediction, retradiction, interpolation, and multi-shot generation within a unified paradigm. Extensive experiments across diverse tasks validate the effectiveness and versatility of our approach.

LoViC: Efficient Long Video Generation with Context Compression

TL;DR

<3-5 sentence high-level summary> LoViC tackles long-form video generation with diffusion transformers by introducing a context-compression pipeline that reduces quadratic self-attention costs. It couples a flexible FlexFormer autoencoder with a single learnable query token and Interpolated-RoPE to compress arbitrary-length video-text context, enabling segment-wise generation for prediction, interpolation, retrodiction, and multi-shot tasks. The approach is trained on a million-scale open-domain video corpus and demonstrates improved temporal coherence and scalability compared to strong baselines, while maintaining competitive non-reference quality with significantly fewer parameters. This work advances practical long-range video synthesis by enabling flexible context conditioning and unified handling of multiple generation tasks within a single DiT-based framework.

Abstract

Despite recent advances in diffusion transformers (DiTs) for text-to-video generation, scaling to long-duration content remains challenging due to the quadratic complexity of self-attention. While prior efforts -- such as sparse attention and temporally autoregressive models -- offer partial relief, they often compromise temporal coherence or scalability. We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos, designed to produce long, coherent videos through a segment-wise generation process. At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations. It supports variable-length inputs with linearly adjustable compression rates, enabled by a single query token design based on the Q-Former architecture. Additionally, by encoding temporal context through position-aware mechanisms, our model seamlessly supports prediction, retradiction, interpolation, and multi-shot generation within a unified paradigm. Extensive experiments across diverse tasks validate the effectiveness and versatility of our approach.

Paper Structure

This paper contains 26 sections, 5 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Videos generated by our model. Our model has the flexibility to do video continuation of any direction and generate multi-shot video with advanced efficiency. It shows the capability of retaining ID consistency within large temporal range, generating videos of large and smooth motion.
  • Figure 2: Memory and time usage of single timestep DiT inference. With context compression, our method reduces memory usage and runtime, allowing more frames to be generated within the same resource constraints.
  • Figure 3: Model architecture. The left part of the figure features an autoencoder consisting of a FlexFormer encoder and a FlexFormer decoder. The encoder compresses multiple segments of video and text tokens separately. The number of query tokens is derived from the video token sequence length. The query token sequence is formed by copying the single learnable token multiple times. The decoder decode the context tokens into video and text features in a similar way. Each context video-text pair is compressed into some context tokens. The multiple chunks of context tokens are concatenated then fed into the DiT by further concatenating with the input tokens of the self-attention layer.
  • Figure 4: Illustration of our positional encoding. Each block represents the positional index $({t}, {h}, {w})$ of the corresponding token. The illustrated compression strategy is uniform compression. To adapt for multi-shot generation, the blue and purple blocks will be separated slightly.
  • Figure 5: Illustration of different compression strategies. Blue and purple dots represent the position of video tokens and query tokens respectively. Text tokens are omitted.
  • ...and 4 more figures