LoViC: Efficient Long Video Generation with Context Compression

Jiaxiu Jiang; Wenbo Li; Jingjing Ren; Yuping Qiu; Yong Guo; Xiaogang Xu; Han Wu; Wangmeng Zuo

LoViC: Efficient Long Video Generation with Context Compression

Jiaxiu Jiang, Wenbo Li, Jingjing Ren, Yuping Qiu, Yong Guo, Xiaogang Xu, Han Wu, Wangmeng Zuo

TL;DR

<3-5 sentence high-level summary> LoViC tackles long-form video generation with diffusion transformers by introducing a context-compression pipeline that reduces quadratic self-attention costs. It couples a flexible FlexFormer autoencoder with a single learnable query token and Interpolated-RoPE to compress arbitrary-length video-text context, enabling segment-wise generation for prediction, interpolation, retrodiction, and multi-shot tasks. The approach is trained on a million-scale open-domain video corpus and demonstrates improved temporal coherence and scalability compared to strong baselines, while maintaining competitive non-reference quality with significantly fewer parameters. This work advances practical long-range video synthesis by enabling flexible context conditioning and unified handling of multiple generation tasks within a single DiT-based framework.

Abstract

Despite recent advances in diffusion transformers (DiTs) for text-to-video generation, scaling to long-duration content remains challenging due to the quadratic complexity of self-attention. While prior efforts -- such as sparse attention and temporally autoregressive models -- offer partial relief, they often compromise temporal coherence or scalability. We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos, designed to produce long, coherent videos through a segment-wise generation process. At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations. It supports variable-length inputs with linearly adjustable compression rates, enabled by a single query token design based on the Q-Former architecture. Additionally, by encoding temporal context through position-aware mechanisms, our model seamlessly supports prediction, retradiction, interpolation, and multi-shot generation within a unified paradigm. Extensive experiments across diverse tasks validate the effectiveness and versatility of our approach.

LoViC: Efficient Long Video Generation with Context Compression

TL;DR

Abstract

LoViC: Efficient Long Video Generation with Context Compression

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)