Is Your Video Language Model a Reliable Judge?
Ming Liu, Wensheng Zhang
TL;DR
The paper interrogates the reliability of video language models (VLMs) as evaluators of VLM outputs, comparing single-model judgments, reference-guided LLM debates, and collective thought aggregations. Using the CVRR-ES dataset with $2{,}400$ QA pairs across $11$ visual dimensions, it quantifies agreement via Weighted Cohen's Kappa $\kappa$ and demonstrates that weaker VLMs tend to overrate candidates, while a strong model like GPT-4o aligns more closely with reference-guided debates. Fine-tuning weaker judges yields only modest gains, indicating that reliability hinges on deeper content understanding and evaluative reasoning rather than mere comprehension. The study highlights the limitations of indiscriminate collective aggregation and advocates reliability-aware evaluation methods to enable more robust automatic evaluation of multimodal models in real-world video content.
Abstract
As video language models (VLMs) gain more applications in various scenarios, the need for robust and scalable evaluation of their performance becomes increasingly critical. The traditional human expert-based evaluation of VLMs has limitations in consistency and scalability, which sparked interest in automatic methods such as employing VLMs to evaluate VLMs. However, the reliability of VLMs as judges remains underexplored. Existing methods often rely on a single VLM as the evaluator. However, this approach can be unreliable or biased because such a model may lack the ability to fully understand the content and may have inherent biases, ultimately compromising evaluation reliability. A remedy is to apply the principle of collective thoughts, aggregating evaluations from multiple VLMs to enhance reliability. This study investigates the efficacy of such approaches, particularly when the pool of judges includes both reliable and unreliable models. Our findings reveal that incorporating collective judgments from such a mixed pool does not necessarily improve the accuracy of the final evaluation. The inclusion of less reliable judges can introduce noise, undermining the overall reliability of the outcomes. To explore the factors that impact evaluation reliability, we fine-tune an underperforming VLM judge, Video-LLaVA, and observe that improved understanding ability alone is insufficient to make VLM judges more reliable. These findings stress the limitations of collective thought approaches and highlight the need for more advanced methods that can account for the reliability of individual models. Our study promotes the development of more reliable evaluation methods for VLMs
