VMDT: Decoding the Trustworthiness of Video Foundation Models

Yujin Potter; Zhun Wang; Nicholas Crispino; Kyle Montgomery; Alexander Xiong; Ethan Y. Chang; Francesco Pinto; Yuqi Chen; Rahul Gupta; Morteza Ziyadi; Christos Christodoulopoulos; Bo Li; Chenguang Wang; Dawn Song

VMDT: Decoding the Trustworthiness of Video Foundation Models

Yujin Potter, Zhun Wang, Nicholas Crispino, Kyle Montgomery, Alexander Xiong, Ethan Y. Chang, Francesco Pinto, Yuqi Chen, Rahul Gupta, Morteza Ziyadi, Christos Christodoulopoulos, Bo Li, Chenguang Wang, Dawn Song

TL;DR

VMDT introduces the first unified benchmark for trustworthiness of video foundation models, covering safety, hallucination, fairness, privacy, and adversarial robustness for both T2V and V2T modalities. Through extensive evaluations of 7 T2V and 19 V2T models, the study reveals persistent safety gaps, substantial hallucination and bias, privacy risks that scale with model size, and vulnerability to adversarial inputs, with notable differences between open- and closed-source models. The framework provides detailed datasets, evaluation protocols, and cross-perspective analyses, enabling systematic tracking of progress and targeted improvements in VFMs. Overall, the findings underscore urgent needs for robust safety alignments, fairness-aware training, privacy protections, and adversarial defenses, as captured by VMDT’s comprehensive measurement platform.

Abstract

As foundation models become more sophisticated, ensuring their trustworthiness becomes increasingly critical; yet, unlike text and image, the video modality still lacks comprehensive trustworthiness benchmarks. We introduce VMDT (Video-Modal DecodingTrust), the first unified platform for evaluating text-to-video (T2V) and video-to-text (V2T) models across five key trustworthiness dimensions: safety, hallucination, fairness, privacy, and adversarial robustness. Through our extensive evaluation of 7 T2V models and 19 V2T models using VMDT, we uncover several significant insights. For instance, all open-source T2V models evaluated fail to recognize harmful queries and often generate harmful videos, while exhibiting higher levels of unfairness compared to image modality models. In V2T models, unfairness and privacy risks rise with scale, whereas hallucination and adversarial robustness improve -- though overall performance remains low. Uniquely, safety shows no correlation with model size, implying that factors other than scale govern current safety levels. Our findings highlight the urgent need for developing more robust and trustworthy video foundation models, and VMDT provides a systematic framework for measuring and tracking progress toward this goal. The code is available at https://sunblaze-ucb.github.io/VMDT-page/.

VMDT: Decoding the Trustworthiness of Video Foundation Models

TL;DR

Abstract

VMDT: Decoding the Trustworthiness of Video Foundation Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (28)