Table of Contents
Fetching ...

VirtueBench: Evaluating Trustworthiness under Uncertainty in Long Video Understanding

Xueqing Yu, Bohan Li, Yan Li, Zhenheng Yang

TL;DR

VirtueBench is introduced, a benchmark explicitly designed to assess model trustworthiness under uncertainty that reveals distinct refusal behaviors across different model families, with refusal accuracy ranging from over 70% in the best models to nearly 0% in the worst.

Abstract

Recent Vision-Language Models (VLMs) have made remarkable progress in multimodal understanding tasks, yet their evaluation on long video understanding remains unreliable. Due to limited frame inputs, key frames necessary for answering the question may be missing from the model's input. However, models that truthfully refuse to answer under such uncertainty are marked as incorrect, while those that guess may coincidentally produce the correct answer and thus obtain deceptively higher accuracy, leading to misleading evaluation results and encouraging models to guess rather than respond honestly. To address this issue, we introduce VirtueBench, a benchmark explicitly designed to assess model trustworthiness under uncertainty. VirtueBench constructs multiple frame-sampling levels for each video and provides ground truths that distinguish between answerable and unanswerable cases. Evaluations on 25 open-source and commercial VLMs reveal distinct refusal behaviors across different model families, with refusal accuracy ranging from over 70% in the best models to nearly 0% in the worst. Moreover, most models exhibit a substantial drop in refusal when the prompt does not explicitly require them to do so. These findings highlight the need for developing trustworthy VLMs for multimodal understanding, guided by benchmarks and leaderboards that emphasize reliability and trustworthiness.

VirtueBench: Evaluating Trustworthiness under Uncertainty in Long Video Understanding

TL;DR

VirtueBench is introduced, a benchmark explicitly designed to assess model trustworthiness under uncertainty that reveals distinct refusal behaviors across different model families, with refusal accuracy ranging from over 70% in the best models to nearly 0% in the worst.

Abstract

Recent Vision-Language Models (VLMs) have made remarkable progress in multimodal understanding tasks, yet their evaluation on long video understanding remains unreliable. Due to limited frame inputs, key frames necessary for answering the question may be missing from the model's input. However, models that truthfully refuse to answer under such uncertainty are marked as incorrect, while those that guess may coincidentally produce the correct answer and thus obtain deceptively higher accuracy, leading to misleading evaluation results and encouraging models to guess rather than respond honestly. To address this issue, we introduce VirtueBench, a benchmark explicitly designed to assess model trustworthiness under uncertainty. VirtueBench constructs multiple frame-sampling levels for each video and provides ground truths that distinguish between answerable and unanswerable cases. Evaluations on 25 open-source and commercial VLMs reveal distinct refusal behaviors across different model families, with refusal accuracy ranging from over 70% in the best models to nearly 0% in the worst. Moreover, most models exhibit a substantial drop in refusal when the prompt does not explicitly require them to do so. These findings highlight the need for developing trustworthy VLMs for multimodal understanding, guided by benchmarks and leaderboards that emphasize reliability and trustworthiness.
Paper Structure (21 sections, 6 figures, 2 tables)

This paper contains 21 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Evaluation example on VideoEval-Pro ma2025videoeval, which consists of open-ended questions collected from major long video benchmarks. The example illustrates a common case where the key frame required to answer the question is missing due to limited frame sampling. As a result, Qwen2.5-VL-72B truthfully refuses to answer and is marked incorrect, whereas LLaVA-Video-72B guesses the correct answer without seeing the necessary evidence, leading to a deceptively higher accuracy. This demonstrates that current long video benchmarks may unintentionally penalize models that honestly refuse under uncertainty, making their evaluation results unreliable.
  • Figure 2: The overall visualization of VirtueBench. The original video is sampled at different frame levels, and corresponding answers are annotated for each sampled clip. In some cases, the key frames necessary to answer the question are not included in the sampled clips, and the answer is thus labeled as “The video does not provide enough information.”
  • Figure 3: Question type distribution (top level and fine-grained type).
  • Figure 4: Frame-level and instance-level distributions of VirtueBench. For the instance-level, each instance corresponds to five clips sampled from the source video (64 to 1024 frames). A value of 5 indicates that all five clips are unanswerable, while 0 indicates that all five are answerable.
  • Figure 5: Comparison of model behaviors under different frame levels.
  • ...and 1 more figures