Table of Contents
Fetching ...

Corruption-Aware Training of Latent Video Diffusion Models for Robust Text-to-Video Generation

Chika Maduabuchi, Hao Chen, Yujin Han, Jindong Wang

TL;DR

This work tackles the vulnerability of latent video diffusion models to imperfect multimodal conditioning by introducing CAT-LVDM, a corruption-aware training framework. It proposes two structured perturbations, Batch-Centered Noise Injection (BCNI) and Spectrum-Aware Contextual Noise (SACN), which constrain noise to low-rank semantic directions and dominant spectral modes, respectively. The authors provide theoretical bounds showing that such rank-constrained perturbations tighten entropy, shrink Wasserstein distances, and accelerate mixing, while also delivering empirical gains across caption-rich and action-focused video datasets. The approach yields state-of-the-art or near-state-of-the-art performance on multiple benchmarks and offers a principled, scalable path to robust text-to-video generation under realistic noisy conditioning conditions.

Abstract

Latent Video Diffusion Models (LVDMs) achieve high-quality generation but are sensitive to imperfect conditioning, which causes semantic drift and temporal incoherence on noisy, web-scale video-text datasets. We introduce CAT-LVDM, the first corruption-aware training framework for LVDMs that improves robustness through structured, data-aligned noise injection. Our method includes Batch-Centered Noise Injection (BCNI), which perturbs embeddings along intra-batch semantic directions to preserve temporal consistency. BCNI is especially effective on caption-rich datasets like WebVid-2M, MSR-VTT, and MSVD. We also propose Spectrum-Aware Contextual Noise (SACN), which injects noise along dominant spectral directions to improve low-frequency smoothness, showing strong results on UCF-101. On average, BCNI reduces FVD by 31.9% across WebVid-2M, MSR-VTT, and MSVD, while SACN yields a 12.3% improvement on UCF-101. Ablation studies confirm the benefit of low-rank, data-aligned noise. Our theoretical analysis further explains how such perturbations tighten entropy, Wasserstein, score-drift, mixing-time, and generalization bounds. CAT-LVDM establishes a principled, scalable training approach for robust video diffusion under multimodal noise. Code and models: https://github.com/chikap421/catlvdm

Corruption-Aware Training of Latent Video Diffusion Models for Robust Text-to-Video Generation

TL;DR

This work tackles the vulnerability of latent video diffusion models to imperfect multimodal conditioning by introducing CAT-LVDM, a corruption-aware training framework. It proposes two structured perturbations, Batch-Centered Noise Injection (BCNI) and Spectrum-Aware Contextual Noise (SACN), which constrain noise to low-rank semantic directions and dominant spectral modes, respectively. The authors provide theoretical bounds showing that such rank-constrained perturbations tighten entropy, shrink Wasserstein distances, and accelerate mixing, while also delivering empirical gains across caption-rich and action-focused video datasets. The approach yields state-of-the-art or near-state-of-the-art performance on multiple benchmarks and offers a principled, scalable path to robust text-to-video generation under realistic noisy conditioning conditions.

Abstract

Latent Video Diffusion Models (LVDMs) achieve high-quality generation but are sensitive to imperfect conditioning, which causes semantic drift and temporal incoherence on noisy, web-scale video-text datasets. We introduce CAT-LVDM, the first corruption-aware training framework for LVDMs that improves robustness through structured, data-aligned noise injection. Our method includes Batch-Centered Noise Injection (BCNI), which perturbs embeddings along intra-batch semantic directions to preserve temporal consistency. BCNI is especially effective on caption-rich datasets like WebVid-2M, MSR-VTT, and MSVD. We also propose Spectrum-Aware Contextual Noise (SACN), which injects noise along dominant spectral directions to improve low-frequency smoothness, showing strong results on UCF-101. On average, BCNI reduces FVD by 31.9% across WebVid-2M, MSR-VTT, and MSVD, while SACN yields a 12.3% improvement on UCF-101. Ablation studies confirm the benefit of low-rank, data-aligned noise. Our theoretical analysis further explains how such perturbations tighten entropy, Wasserstein, score-drift, mixing-time, and generalization bounds. CAT-LVDM establishes a principled, scalable training approach for robust video diffusion under multimodal noise. Code and models: https://github.com/chikap421/catlvdm

Paper Structure

This paper contains 69 sections, 34 theorems, 218 equations, 13 figures, 8 tables, 1 algorithm.

Key Result

Proposition A.2

Let $\sigma_{z}^{2}=\lambda_{\min}\!\bigl(\operatorname{Cov}[Z]\bigr)\!>\!0$ and assume $\rho>0$. For BCNI or SACN corruption of rank $d$, whereas isotropic CEP attains the same bound with $D$ in place of $d$.

Figures (13)

  • Figure 1: Overview Image. We introduce corruption (BCNI, Gaussian, Uniform) and compare to the Clean baseline. We show visual generations in (a) and summarize quantitative scores across 13 metrics in (b). The full metric results are available in Appendix Table \ref{['tab:full-results']}. Generated videos are provided in the Supplementary Material.
  • Figure 2: Model Benchmark. Benchmark comparison of video generation quality on MSRVTT and UCF101 datasets using FVD ($\downarrow$). Baseline results are adapted from prior works zhou2023magicvideoefficientvideogenerationZhang2025wang2023modelscopetexttovideotechnicalreportwang2023videocomposerXing_2024_CVPRluo2023videofusiondecomposeddiffusionmodelsqiu2024freenoise10657833NEURIPS2024_81f19c0ehong2023cogvideo10203078li2023videogenreferenceguidedlatentdiffusionWang202410.1007/978-3-031-73033-7_12ma2025latteyu2024efficientyu2022generating.
  • Figure 3: Qualitative comparison of corruption types. Each video is generated with 16 frames. We sample and visualize 10 representative examples under various settings. Full videos are in the supplementary.
  • Figure 4: Ablation Study: Guidance Scale and DDIM Steps. Each row shows metric variations with corruption ratio for different generation parameters.
  • Figure 5: Visual Representation of Video Captions. The extracted frames depict the scene described by the original captions before corruption. The video illustrates a sales manager handing over car keys to a man seated in the driver’s seat. This serves as a reference to understand how different noise levels impact text descriptions of the same visual content.
  • ...and 8 more figures

Theorems & Definitions (63)

  • Definition A.1: Entropy Increment
  • Proposition A.2: Subspace Entropy Lower Bound
  • proof
  • Lemma A.3: Matrix Determinant Lemma
  • Theorem A.4: Directional Cost Reduction
  • proof
  • Lemma A.5: Local Score Drift
  • proof
  • Lemma A.6: Exact Recursion
  • proof
  • ...and 53 more