Table of Contents
Fetching ...

Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation

Xin Yan, Yuxuan Cai, Qiuyue Wang, Yuan Zhou, Wenhao Huang, Huan Yang

TL;DR

Presto introduces Segmented Cross-Attention (SCA) to enable long video diffusion with multiple progressive text conditions, achieving coherent, richly detailed 15-second videos. Built on a high-quality LongTake-HD dataset (261k pre-training clips; 47k fine-tuning clips), Presto divides temporal latent states into segments and cross-attends each to corresponding sub-captions, with OSCA as the preferred variant. The method extends the diffusion transformer (DiT) architecture without extra parameters, combining improved content richness, long-range coherence, and accurate text-video alignment. Empirical results show state-of-the-art performance on VBench (Semantic Score 78.5%) and Dynamic Degree (100%), along with strong human-evaluated distinctions in scenario diversity and coherence, demonstrating the value of curated data and segmented cross-attention for long-form video generation.

Abstract

We introduce Presto, a novel video diffusion model designed to generate 15-second videos with long-range coherence and rich content. Extending video generation methods to maintain scenario diversity over long durations presents significant challenges. To address this, we propose a Segmented Cross-Attention (SCA) strategy, which splits hidden states into segments along the temporal dimension, allowing each segment to cross-attend to a corresponding sub-caption. SCA requires no additional parameters, enabling seamless incorporation into current DiT-based architectures. To facilitate high-quality long video generation, we build the LongTake-HD dataset, consisting of 261k content-rich videos with scenario coherence, annotated with an overall video caption and five progressive sub-captions. Experiments show that our Presto achieves 78.5% on the VBench Semantic Score and 100% on the Dynamic Degree, outperforming existing state-of-the-art video generation methods. This demonstrates that our proposed Presto significantly enhances content richness, maintains long-range coherence, and captures intricate textual details. More details are displayed on our project page: https://presto-video.github.io/.

Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation

TL;DR

Presto introduces Segmented Cross-Attention (SCA) to enable long video diffusion with multiple progressive text conditions, achieving coherent, richly detailed 15-second videos. Built on a high-quality LongTake-HD dataset (261k pre-training clips; 47k fine-tuning clips), Presto divides temporal latent states into segments and cross-attends each to corresponding sub-captions, with OSCA as the preferred variant. The method extends the diffusion transformer (DiT) architecture without extra parameters, combining improved content richness, long-range coherence, and accurate text-video alignment. Empirical results show state-of-the-art performance on VBench (Semantic Score 78.5%) and Dynamic Degree (100%), along with strong human-evaluated distinctions in scenario diversity and coherence, demonstrating the value of curated data and segmented cross-attention for long-form video generation.

Abstract

We introduce Presto, a novel video diffusion model designed to generate 15-second videos with long-range coherence and rich content. Extending video generation methods to maintain scenario diversity over long durations presents significant challenges. To address this, we propose a Segmented Cross-Attention (SCA) strategy, which splits hidden states into segments along the temporal dimension, allowing each segment to cross-attend to a corresponding sub-caption. SCA requires no additional parameters, enabling seamless incorporation into current DiT-based architectures. To facilitate high-quality long video generation, we build the LongTake-HD dataset, consisting of 261k content-rich videos with scenario coherence, annotated with an overall video caption and five progressive sub-captions. Experiments show that our Presto achieves 78.5% on the VBench Semantic Score and 100% on the Dynamic Degree, outperforming existing state-of-the-art video generation methods. This demonstrates that our proposed Presto significantly enhances content richness, maintains long-range coherence, and captures intricate textual details. More details are displayed on our project page: https://presto-video.github.io/.

Paper Structure

This paper contains 22 sections, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Presto can generate long videos with rich content and long-range coherence.
  • Figure 2: (a) The overall architecture of our Presto, which integrates multiple text inputs concurrently. (b) The Segmented Cross-Attention strategy has three variants: 1) Isolated Segmented Cross-Attention (ISCA) directly splits the hidden states along the temporal dimension. The output is concatenated by multiple segments' output. 2) Sequential Segmented Cross-Attention (SSCA) where each segment will see all the previous text conditions. All the overlapped regions are averaged and concatenated with other regions. 3) Overlap Segmented Cross-Attention (OSCA) that is adopted in our method. Only frames at the segment boundary will cross-attend with multiple text conditions.
  • Figure 3: Qualitative comparison with the baselines in our user study. Our Presto can capture intricate text details and generate long videos with long-range coherence and rich content. For the first case, ours is the only method that captures the text details of "People hurry along the sidewalk", while other methods fail to generate walking people. For the second case, our generated videos are of the largest camera motion and the best scenario coherence.
  • Figure 4: The discarded and selected data samples of different filtering steps in LongTake-HD. We discard cases with similar keyframes and poor content diversity and filter out similar and negative captions. The selected cases have rich video content, coherent scenario motion, and progressive captions. We visualize the samples in the LongTake-HD Pre-training set and apply more rigorous filtering to develop the LongTake-HD Fine-tuning set.
  • Figure 5: The progressive sub-captions and coherent video frames of our LongTake-HD dataset. Our captions are more detailed in camera motion, as highlighted in the red text.
  • ...and 4 more figures