Table of Contents
Fetching ...

VMDT: Decoding the Trustworthiness of Video Foundation Models

Yujin Potter, Zhun Wang, Nicholas Crispino, Kyle Montgomery, Alexander Xiong, Ethan Y. Chang, Francesco Pinto, Yuqi Chen, Rahul Gupta, Morteza Ziyadi, Christos Christodoulopoulos, Bo Li, Chenguang Wang, Dawn Song

TL;DR

VMDT introduces the first unified benchmark for trustworthiness of video foundation models, covering safety, hallucination, fairness, privacy, and adversarial robustness for both T2V and V2T modalities. Through extensive evaluations of 7 T2V and 19 V2T models, the study reveals persistent safety gaps, substantial hallucination and bias, privacy risks that scale with model size, and vulnerability to adversarial inputs, with notable differences between open- and closed-source models. The framework provides detailed datasets, evaluation protocols, and cross-perspective analyses, enabling systematic tracking of progress and targeted improvements in VFMs. Overall, the findings underscore urgent needs for robust safety alignments, fairness-aware training, privacy protections, and adversarial defenses, as captured by VMDT’s comprehensive measurement platform.

Abstract

As foundation models become more sophisticated, ensuring their trustworthiness becomes increasingly critical; yet, unlike text and image, the video modality still lacks comprehensive trustworthiness benchmarks. We introduce VMDT (Video-Modal DecodingTrust), the first unified platform for evaluating text-to-video (T2V) and video-to-text (V2T) models across five key trustworthiness dimensions: safety, hallucination, fairness, privacy, and adversarial robustness. Through our extensive evaluation of 7 T2V models and 19 V2T models using VMDT, we uncover several significant insights. For instance, all open-source T2V models evaluated fail to recognize harmful queries and often generate harmful videos, while exhibiting higher levels of unfairness compared to image modality models. In V2T models, unfairness and privacy risks rise with scale, whereas hallucination and adversarial robustness improve -- though overall performance remains low. Uniquely, safety shows no correlation with model size, implying that factors other than scale govern current safety levels. Our findings highlight the urgent need for developing more robust and trustworthy video foundation models, and VMDT provides a systematic framework for measuring and tracking progress toward this goal. The code is available at https://sunblaze-ucb.github.io/VMDT-page/.

VMDT: Decoding the Trustworthiness of Video Foundation Models

TL;DR

VMDT introduces the first unified benchmark for trustworthiness of video foundation models, covering safety, hallucination, fairness, privacy, and adversarial robustness for both T2V and V2T modalities. Through extensive evaluations of 7 T2V and 19 V2T models, the study reveals persistent safety gaps, substantial hallucination and bias, privacy risks that scale with model size, and vulnerability to adversarial inputs, with notable differences between open- and closed-source models. The framework provides detailed datasets, evaluation protocols, and cross-perspective analyses, enabling systematic tracking of progress and targeted improvements in VFMs. Overall, the findings underscore urgent needs for robust safety alignments, fairness-aware training, privacy protections, and adversarial defenses, as captured by VMDT’s comprehensive measurement platform.

Abstract

As foundation models become more sophisticated, ensuring their trustworthiness becomes increasingly critical; yet, unlike text and image, the video modality still lacks comprehensive trustworthiness benchmarks. We introduce VMDT (Video-Modal DecodingTrust), the first unified platform for evaluating text-to-video (T2V) and video-to-text (V2T) models across five key trustworthiness dimensions: safety, hallucination, fairness, privacy, and adversarial robustness. Through our extensive evaluation of 7 T2V models and 19 V2T models using VMDT, we uncover several significant insights. For instance, all open-source T2V models evaluated fail to recognize harmful queries and often generate harmful videos, while exhibiting higher levels of unfairness compared to image modality models. In V2T models, unfairness and privacy risks rise with scale, whereas hallucination and adversarial robustness improve -- though overall performance remains low. Uniquely, safety shows no correlation with model size, implying that factors other than scale govern current safety levels. Our findings highlight the urgent need for developing more robust and trustworthy video foundation models, and VMDT provides a systematic framework for measuring and tracking progress toward this goal. The code is available at https://sunblaze-ucb.github.io/VMDT-page/.

Paper Structure

This paper contains 181 sections, 12 equations, 28 figures, 57 tables.

Figures (28)

  • Figure 1: Examples of untrustworthy model responses for each perspective
  • Figure 2: Average harmful content generation rate (HGR) for evaluating the safety of V2T models. Different model families are represented by distinct colors.
  • Figure 3: Average accuracy of V2T models over all hallucination scenarios as a function of model size. Different model families are represented by distinct colors. Within model families, performance tends to increase as model size increases.
  • Figure 4: Age stereotype score of V2T models by size. The scores range from $-1$ to $1,$ where positive values indicate stereotypes associating older people with higher socioeconomic status, negative values associate younger people with higher status, and $0$ represents perfect fairness. The figure shows larger models demonstrate stronger stereotypical associations between older age and higher socioeconomic status. Model families are distinguished by color.
  • Figure 5: A scatter plot between location inference accuracy and model size. This suggests that larger models generally demonstrate greater precision in identifying specific locations, indicating elevated privacy risks.
  • ...and 23 more figures