Table of Contents
Fetching ...

Consistency-Preserving Diverse Video Generation

Xinshuang Liu, Runfa Blark Li, Truong Nguyen

TL;DR

The paper tackles the dual challenge of achieving high cross-video diversity while preserving intra-video temporal coherence in text-to-video generation under limited compute. It introduces a consistency-preserving joint sampling framework for flow-matching video generators that combines a diversity gradient with a consistency-aware regulator, computed entirely in lightweight latent space to avoid decoder backpropagation. Latent-space embedding and interpolation models are trained to approximate video- and frame-level guidance without decoding, enabling efficient optimization of diversity and temporal coherence. Empirical results on a state-of-the-art flow-matching model show diversity comparable to strong baselines with substantially improved temporal consistency and color naturalness, suggesting practical gains for constrained video generation.

Abstract

Text-to-video generation is expensive, so only a few samples are typically produced per prompt. In this low-sample regime, maximizing the value of each batch requires high cross-video diversity. Recent methods improve diversity for image generation, but for videos they often degrade within-video temporal consistency and require costly backpropagation through a video decoder. We propose a joint-sampling framework for flow-matching video generators that improves batch diversity while preserving temporal consistency. Our approach applies diversity-driven updates and then removes only the components that would decrease a temporal-consistency objective. To avoid image-space gradients, we compute both objectives with lightweight latent-space models, avoiding video decoding and decoder backpropagation. Experiments on a state-of-the-art text-to-video flow-matching model show diversity comparable to strong joint-sampling baselines while substantially improving temporal consistency and color naturalness. Code will be released.

Consistency-Preserving Diverse Video Generation

TL;DR

The paper tackles the dual challenge of achieving high cross-video diversity while preserving intra-video temporal coherence in text-to-video generation under limited compute. It introduces a consistency-preserving joint sampling framework for flow-matching video generators that combines a diversity gradient with a consistency-aware regulator, computed entirely in lightweight latent space to avoid decoder backpropagation. Latent-space embedding and interpolation models are trained to approximate video- and frame-level guidance without decoding, enabling efficient optimization of diversity and temporal coherence. Empirical results on a state-of-the-art flow-matching model show diversity comparable to strong baselines with substantially improved temporal consistency and color naturalness, suggesting practical gains for constrained video generation.

Abstract

Text-to-video generation is expensive, so only a few samples are typically produced per prompt. In this low-sample regime, maximizing the value of each batch requires high cross-video diversity. Recent methods improve diversity for image generation, but for videos they often degrade within-video temporal consistency and require costly backpropagation through a video decoder. We propose a joint-sampling framework for flow-matching video generators that improves batch diversity while preserving temporal consistency. Our approach applies diversity-driven updates and then removes only the components that would decrease a temporal-consistency objective. To avoid image-space gradients, we compute both objectives with lightweight latent-space models, avoiding video decoding and decoder backpropagation. Experiments on a state-of-the-art text-to-video flow-matching model show diversity comparable to strong joint-sampling baselines while substantially improving temporal consistency and color naturalness. Code will be released.
Paper Structure (13 sections, 18 equations, 3 figures, 2 tables)

This paper contains 13 sections, 18 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Joint video generation with enhanced cross-video diversity and preserved intra-video temporal consistency based on latent-space embedding and interpolation.
  • Figure 2: Illustration of training procedure for latent-space embedding and interpolation models.
  • Figure 3: Model behavior across flow-matching steps: (a) Extrapolation of $\hat{x}_1 = x_t + (1-t)v_\theta(x_t,t)$ from intermediate latent states and decoding to video frames. (b)--(e) Embedding alignment losses for our latent embedding models, compared to using a latent-mean baseline. (f) Frame interpolation loss for $M_c$, compared to simple baselines (previous frame, next frame, and mean of both).