Table of Contents
Fetching ...

VC-Inspector: Advancing Reference-free Evaluation of Video Captions with Factual Analy

Shubhashis Roy Dipta, Tz-Ying Wu, Subarna Tripathi

TL;DR

VC-Inspector introduces a fact-grounded, reference-free framework for video caption evaluation using a lightweight open-source LMM fine-tuned with a large synthetic data pipeline that generates captions with controllable factual errors and accompanying explanations. It outputs both a quality score and natural-language explanations, enabling interpretability and caption refinement. Across synthetic, VATEX-Eval, Flickr8K, and YouCook2 datasets, VC-Inspector achieves high correlation with human judgments, often surpassing many reference-based and other no-reference metrics, while maintaining efficiency. This work provides a scalable, reproducible approach to evaluating captions in real-world videos and highlights the value of explanation-enabled supervision for caption quality and refinement.

Abstract

We propose VC-Inspector, a lightweight, open-source large multimodal model (LMM) for reference-free evaluation of video captions, with a focus on factual accuracy. Unlike existing metrics that suffer from limited context handling, weak factuality assessment, or reliance on proprietary services, VC-Inspector offers a reproducible, fact-aware alternative that aligns closely with human judgments. To enable robust training and interpretable evaluation, we introduce a systematic approach for generating captions with controllable errors, paired with graded quality scores and explanatory annotations. Experiments show that VC-Inspector achieves state-of-the-art correlation with human judgments, generalizing across diverse domains (e.g., VATEX-Eval, Flickr8K-Expert, and Flickr8K-CF benchmarks) and revealing the potential for caption improvement.

VC-Inspector: Advancing Reference-free Evaluation of Video Captions with Factual Analy

TL;DR

VC-Inspector introduces a fact-grounded, reference-free framework for video caption evaluation using a lightweight open-source LMM fine-tuned with a large synthetic data pipeline that generates captions with controllable factual errors and accompanying explanations. It outputs both a quality score and natural-language explanations, enabling interpretability and caption refinement. Across synthetic, VATEX-Eval, Flickr8K, and YouCook2 datasets, VC-Inspector achieves high correlation with human judgments, often surpassing many reference-based and other no-reference metrics, while maintaining efficiency. This work provides a scalable, reproducible approach to evaluating captions in real-world videos and highlights the value of explanation-enabled supervision for caption quality and refinement.

Abstract

We propose VC-Inspector, a lightweight, open-source large multimodal model (LMM) for reference-free evaluation of video captions, with a focus on factual accuracy. Unlike existing metrics that suffer from limited context handling, weak factuality assessment, or reliance on proprietary services, VC-Inspector offers a reproducible, fact-aware alternative that aligns closely with human judgments. To enable robust training and interpretable evaluation, we introduce a systematic approach for generating captions with controllable errors, paired with graded quality scores and explanatory annotations. Experiments show that VC-Inspector achieves state-of-the-art correlation with human judgments, generalizing across diverse domains (e.g., VATEX-Eval, Flickr8K-Expert, and Flickr8K-CF benchmarks) and revealing the potential for caption improvement.

Paper Structure

This paper contains 36 sections, 2 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Existing reference-free metrics like EMScore shi_emscore_2022 often fail to detect factual inaccuracies and lack a consistent scoring scale. VC-Inspector addresses these limitations by providing factually grounded, interpretable evaluations.
  • Figure 2: (left) We present a data generation pipeline designed to systematically create synthetic video captions with diverse quality scores, along with explanations for the assigned scores. (right) This dataset was subsequently used for instruction tuning the VC-Inspector.
  • Figure 3: Data generation pipeline to create a synthetic dataset for training VC-Inspector. While both "talking" and "holding" were identified as actions, only "holding" was sampled for replacement in the synthetic dataset.
  • Figure 4: Visual example on VATEX-Eval. VC-Inspector produces quality assessments consistent with ground truth scores, and factual error explanations (highlighted in red). More examples are in \ref{['app:visual']}.
  • Figure 5: Visual examples from ActivityNet-FG-Eval (top) and VATEX-Eval (others). VC-Inspector produces quality assessments consistent with ground truth scores, and explanatory insights into factual errors (highlighted in red).
  • ...and 1 more figures