VC-Inspector: Advancing Reference-free Evaluation of Video Captions with Factual Analy
Shubhashis Roy Dipta, Tz-Ying Wu, Subarna Tripathi
TL;DR
VC-Inspector introduces a fact-grounded, reference-free framework for video caption evaluation using a lightweight open-source LMM fine-tuned with a large synthetic data pipeline that generates captions with controllable factual errors and accompanying explanations. It outputs both a quality score and natural-language explanations, enabling interpretability and caption refinement. Across synthetic, VATEX-Eval, Flickr8K, and YouCook2 datasets, VC-Inspector achieves high correlation with human judgments, often surpassing many reference-based and other no-reference metrics, while maintaining efficiency. This work provides a scalable, reproducible approach to evaluating captions in real-world videos and highlights the value of explanation-enabled supervision for caption quality and refinement.
Abstract
We propose VC-Inspector, a lightweight, open-source large multimodal model (LMM) for reference-free evaluation of video captions, with a focus on factual accuracy. Unlike existing metrics that suffer from limited context handling, weak factuality assessment, or reliance on proprietary services, VC-Inspector offers a reproducible, fact-aware alternative that aligns closely with human judgments. To enable robust training and interpretable evaluation, we introduce a systematic approach for generating captions with controllable errors, paired with graded quality scores and explanatory annotations. Experiments show that VC-Inspector achieves state-of-the-art correlation with human judgments, generalizing across diverse domains (e.g., VATEX-Eval, Flickr8K-Expert, and Flickr8K-CF benchmarks) and revealing the potential for caption improvement.
