Table of Contents
Fetching ...

VidLBEval: Benchmarking and Mitigating Language Bias in Video-Involved LVLMs

Yiming Yang, Yangyang Guo, Hui Lu, Yan Wang

TL;DR

This work addresses language bias in video-involved LVLMs by introducing VidLBEval, a benchmark with Ambiguous Video Contrast (AVC) and Interrogative Question Probing (IQP) tasks to quantify grounded visual reasoning failures. It proposes Multi-branch Contrastive Decoding (MCD), a decoding-time mitigation using a weak expert and a video-enhanced strong expert to counteract language priors without retraining. Experiments show widespread bias across open-source and proprietary LVLMs, with MCD consistently reducing bias while preserving general capabilities on auxiliary benchmarks. The approach offers a practical, deployment-friendly path to more reliable video-grounded multimodal reasoning in LVLMs.

Abstract

Recently, Large Vision-Language Models (LVLMs) have made significant strides across diverse multimodal tasks and benchmarks. This paper reveals a largely under-explored problem from existing video-involved LVLMs - language bias, where models tend to prioritize language over video and thus result in incorrect responses. To address this research gap, we first collect a Video Language Bias Evaluation Benchmark, which is specifically designed to assess the language bias in video-involved LVLMs through two key tasks: ambiguous video contrast and interrogative question probing. Accordingly, we design accompanied evaluation metrics that aim to penalize LVLMs being biased by language. In addition, we also propose Multi-branch Contrastive Decoding (MCD), introducing two expert branches to simultaneously counteract language bias potentially generated by the amateur text-only branch. Our experiments demonstrate that i) existing video-involved LVLMs, including both proprietary and open-sourced, are largely limited by the language bias problem; ii) our MCD can effectively mitigate this issue and maintain general-purpose capabilities in various video-involved LVLMs without any additional retraining or alteration to model architectures.

VidLBEval: Benchmarking and Mitigating Language Bias in Video-Involved LVLMs

TL;DR

This work addresses language bias in video-involved LVLMs by introducing VidLBEval, a benchmark with Ambiguous Video Contrast (AVC) and Interrogative Question Probing (IQP) tasks to quantify grounded visual reasoning failures. It proposes Multi-branch Contrastive Decoding (MCD), a decoding-time mitigation using a weak expert and a video-enhanced strong expert to counteract language priors without retraining. Experiments show widespread bias across open-source and proprietary LVLMs, with MCD consistently reducing bias while preserving general capabilities on auxiliary benchmarks. The approach offers a practical, deployment-friendly path to more reliable video-grounded multimodal reasoning in LVLMs.

Abstract

Recently, Large Vision-Language Models (LVLMs) have made significant strides across diverse multimodal tasks and benchmarks. This paper reveals a largely under-explored problem from existing video-involved LVLMs - language bias, where models tend to prioritize language over video and thus result in incorrect responses. To address this research gap, we first collect a Video Language Bias Evaluation Benchmark, which is specifically designed to assess the language bias in video-involved LVLMs through two key tasks: ambiguous video contrast and interrogative question probing. Accordingly, we design accompanied evaluation metrics that aim to penalize LVLMs being biased by language. In addition, we also propose Multi-branch Contrastive Decoding (MCD), introducing two expert branches to simultaneously counteract language bias potentially generated by the amateur text-only branch. Our experiments demonstrate that i) existing video-involved LVLMs, including both proprietary and open-sourced, are largely limited by the language bias problem; ii) our MCD can effectively mitigate this issue and maintain general-purpose capabilities in various video-involved LVLMs without any additional retraining or alteration to model architectures.

Paper Structure

This paper contains 16 sections, 5 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Examples of the two involved evaluation tasks in VidLBEval. (Left) Ambiguous Video Contrast: We collect a complementary video that is semantically similar to the original video, yet with different answers. The LVLM provides the same answer for the same query pertaining to the two videos. (Right) Interrogative Question Probing: The follow-up question requires a joint understanding of the video and text. The model tends to ignore the video context by reasoning with its LLM parametric knowledge, e.g., linking hoop with rubber float.
  • Figure 2: VidLBEval quality control pipeline. i) We first filter out questions that can be answered correctly without referring to the associated video by utilizing several LLMs such as Qwen2. ii) External tools, i.e., Perspective API and GPT-4o/4V, are then employed for further safety checks. iii) Finally, we conduct human verification to review the results, leading to 1,695 high-quality samples for our VidLBEval dataset.
  • Figure 3: Prediction interplay of the answers from the original questions and the follow-up questions.
  • Figure 4: Architecture of our MCD. Two expert branches are introduced to simultaneously mitigate language bias from the amateur text-only branch: the weak expert retaining the original model process and the strong expert laying more attention on video features.
  • Figure 5: Results on SEEDBench and MVBench when applying our proposed method to VideoLLaVA.
  • ...and 1 more figures