Table of Contents
Fetching ...

How Confident are Video Models? Empowering Video Models to Express their Uncertainty

Zhiting Mei, Ola Shorinwa, Anirudha Majumdar

TL;DR

The paper tackles the safety risk of hallucinations in text-conditioned video generation by introducing the first uncertainty quantification framework for generative video models. It proposes S-QUBED, a black-box approach that decomposes total predictive uncertainty into epistemic and aleatoric components via latent-space conditioning, and provides a calibration metric based on rank correlations. A VMF-based latent modeling strategy, LLM-assisted latent prompt refinement, and a dedicated UQ dataset enable robust estimation and benchmarking. Empirical results on large video datasets demonstrate that S-QUBED yields calibrated uncertainty estimates that negatively correlate with task accuracy and successfully separates the two uncertainty types. This work advances trustworthy video generation by enabling models to express and quantify their uncertainty, with practical impact on safety-critical applications.

Abstract

Generative video models demonstrate impressive text-to-video capabilities, spurring widespread adoption in many real-world applications. However, like large language models (LLMs), video generation models tend to hallucinate, producing plausible videos even when they are factually wrong. Although uncertainty quantification (UQ) of LLMs has been extensively studied in prior work, no UQ method for video models exists, raising critical safety concerns. To our knowledge, this paper represents the first work towards quantifying the uncertainty of video models. We present a framework for uncertainty quantification of generative video models, consisting of: (i) a metric for evaluating the calibration of video models based on robust rank correlation estimation with no stringent modeling assumptions; (ii) a black-box UQ method for video models (termed S-QUBED), which leverages latent modeling to rigorously decompose predictive uncertainty into its aleatoric and epistemic components; and (iii) a UQ dataset to facilitate benchmarking calibration in video models. By conditioning the generation task in the latent space, we disentangle uncertainty arising due to vague task specifications from that arising from lack of knowledge. Through extensive experiments on benchmark video datasets, we demonstrate that S-QUBED computes calibrated total uncertainty estimates that are negatively correlated with the task accuracy and effectively computes the aleatoric and epistemic constituents.

How Confident are Video Models? Empowering Video Models to Express their Uncertainty

TL;DR

The paper tackles the safety risk of hallucinations in text-conditioned video generation by introducing the first uncertainty quantification framework for generative video models. It proposes S-QUBED, a black-box approach that decomposes total predictive uncertainty into epistemic and aleatoric components via latent-space conditioning, and provides a calibration metric based on rank correlations. A VMF-based latent modeling strategy, LLM-assisted latent prompt refinement, and a dedicated UQ dataset enable robust estimation and benchmarking. Empirical results on large video datasets demonstrate that S-QUBED yields calibrated uncertainty estimates that negatively correlate with task accuracy and successfully separates the two uncertainty types. This work advances trustworthy video generation by enabling models to express and quantify their uncertainty, with practical impact on safety-critical applications.

Abstract

Generative video models demonstrate impressive text-to-video capabilities, spurring widespread adoption in many real-world applications. However, like large language models (LLMs), video generation models tend to hallucinate, producing plausible videos even when they are factually wrong. Although uncertainty quantification (UQ) of LLMs has been extensively studied in prior work, no UQ method for video models exists, raising critical safety concerns. To our knowledge, this paper represents the first work towards quantifying the uncertainty of video models. We present a framework for uncertainty quantification of generative video models, consisting of: (i) a metric for evaluating the calibration of video models based on robust rank correlation estimation with no stringent modeling assumptions; (ii) a black-box UQ method for video models (termed S-QUBED), which leverages latent modeling to rigorously decompose predictive uncertainty into its aleatoric and epistemic components; and (iii) a UQ dataset to facilitate benchmarking calibration in video models. By conditioning the generation task in the latent space, we disentangle uncertainty arising due to vague task specifications from that arising from lack of knowledge. Through extensive experiments on benchmark video datasets, we demonstrate that S-QUBED computes calibrated total uncertainty estimates that are negatively correlated with the task accuracy and effectively computes the aleatoric and epistemic constituents.

Paper Structure

This paper contains 16 sections, 2 theorems, 14 equations, 5 figures.

Key Result

Proposition 1

Define the total predictive uncertainty in the output video as the differential entropy $h(V\mid\ell)$ of the distribution $f_{\theta}(V \mid \ell)$. Then, this quantity can be decomposed as: where $h(V \mid Z)$ represents the epistemic uncertainty in $v$, and $h(Z \mid \ell)$ the aleatoric uncertainty.

Figures (5)

  • Figure 1: Video models are unable to express their uncertainty, posing a critical limitation especially in tasks where they lack requisite knowledge. Here, the video model generates an inaccurate video (showing Albert Einstein), when prompted to generate a video of Jeff Einstein. To this end, we introduce a metric for evaluating the calibration of video models, a calibrated uncertainty quantification method (S-QUBED) which uses latent modeling to disentangle aleatoric and epistemic uncertainty, and a UQ dataset for benchmarking calibration.
  • Figure 2: S-QUBED architecture. Given a text prompt $\ell$, our goal is to quantify the uncertainty of the video generation model. We first generate $n$ latent prompts consistent with $\ell$ in line with the prompt refinement used by video models, modeling the aleatoric uncertainty as the entropy of the distribution over latent prompts. Then, for each latent prompt, we generate $m$ videos, modeling the epistemic uncertainty as the conditional entropy of the distribution over generated videos. Finally, aggregating the two types of uncertainties yields the total predictive uncertainty.
  • Figure 3: Calibration Metrics for Video Models. Top: We examine the statistical significance of the Kendall rank correlation between uncertainty and widely-used perceptual metrics. We find that the CLIP cosine similarity score provides the most significant correlation. Bottom: With the CLIP accuracy metric, we observe that low human-annotated uncertainty corresponds to smaller variance in the generated videos and greater accuracy with respect to the ground-truth video. As uncertainty increases, video prediction accuracy decreases.
  • Figure 4: Total Predictive Uncertainty for Video Models. We assess the calibration of the total predictive uncertainty computed by S-QUBED. Top: correlation between video prediction accuracy and total uncertainty for Panda-70M and VidGen-1M . We observe a statistically significant correlation between accuracy and uncertainty for both datasets, signified by the small $p$-values. Bottom: visualization of two samples from Panda-70M.
  • Figure 5: Disentangling Aleatoric and Epistemic Uncertainty for Video Models. We demonstrate the calibration of the aleatoric uncertainty estimates of S-QUBED in tasks with no epistemic uncertainty, showing statistically significant negative correlation. We do the same for epistemic uncertainty.

Theorems & Definitions (3)

  • Proposition 1: Uncertainty Decomposition
  • Proposition 1: Uncertainty Decomposition
  • proof