Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding

Youze Wang; Zijun Chen; Ruoyu Chen; Shishen Gu; Wenbo Hu; Jiayang Liu; Yinpeng Dong; Hang Su; Jun Zhu; Meng Wang; Richang Hong

Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding

Youze Wang, Zijun Chen, Ruoyu Chen, Shishen Gu, Wenbo Hu, Jiayang Liu, Yinpeng Dong, Hang Su, Jun Zhu, Meng Wang, Richang Hong

TL;DR

Trust-videoLLMs presents a comprehensive benchmark and toolkit to evaluate trustworthiness in videoLLMs across truthfulness, robustness, safety, fairness, and privacy using 30 tasks and a large, diverse video dataset. The study analyzes 23 models (open- and closed-source) and finds that proprietary models usually excel overall but still struggle with temporal reasoning, safety under multimodal perturbations, and bias mitigation, while certain open-source models show competitive performance in specific tasks due to architectural innovations. A public, extensible toolbox enables standardized, scalable evaluation to close the gap between accuracy-centric benchmarks and safety/robustness demands. The findings highlight the need for diverse training data, robust multimodal alignment, and targeted safety and fairness mechanisms to deploy videoLLMs reliably in real-world settings.

Abstract

Recent advancements in multimodal large language models for video understanding (videoLLMs) have enhanced their capacity to process complex spatiotemporal data. However, challenges such as factual inaccuracies, harmful content, biases, hallucinations, and privacy risks compromise their reliability. This study introduces Trust-videoLLMs, a first comprehensive benchmark evaluating 23 state-of-the-art videoLLMs (5 commercial, 18 open-source) across five critical dimensions: truthfulness, robustness, safety, fairness, and privacy. Comprising 30 tasks with adapted, synthetic, and annotated videos, the framework assesses spatiotemporal risks, temporal consistency and cross-modal impact. Results reveal significant limitations in dynamic scene comprehension, cross-modal perturbation resilience and real-world risk mitigation. While open-source models occasionally outperform, proprietary models generally exhibit superior credibility, though scaling does not consistently improve performance. These findings underscore the need for enhanced training datat diversity and robust multimodal alignment. Trust-videoLLMs provides a publicly available, extensible toolkit for standardized trustworthiness assessments, addressing the critical gap between accuracy-focused benchmarks and demands for robustness, safety, fairness, and privacy.

Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding

TL;DR

Abstract

Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (49)