Table of Contents
Fetching ...

FIOVA: A Multi-Annotator Benchmark for Human-Aligned Video Captioning

Shiyu Hu, Xuchen Li, Xuzhao Li, Jing Zhang, Yipei Wang, Xin Zhao, Kang Hao Cheong

TL;DR

FIOVA introduces a cognitively aligned benchmark for long-video captioning by collecting multi-annotator descriptions (Five-In-One Video Annotations) and synthesizing a unified groundtruth via GPT. It adds FIOVA-DQ, a cognitively weighted event-level metric, and a three-tier evaluation framework (lexical, event-based AutoDQ, and cognitive FIOVA-DQ) to diagnose LVLM alignment with human perception. The study benchmarks nine LVLMs, analyzes inter-annotator variability with CV, and examines performance on a challenging FIOVA_hard subset, revealing persistent coverage gaps and narrative coherence issues. Collectively, FIOVA provides a diagnostic tool and evaluation standard to guide the development of more human-aligned, temporally coherent long-video understanding models.

Abstract

Despite rapid progress in large vision-language models (LVLMs), existing video caption benchmarks remain limited in evaluating their alignment with human understanding. Most rely on a single annotation per video and lexical similarity-based metrics, failing to capture the variability in human perception and the cognitive importance of events. These limitations hinder accurate diagnosis of model capabilities in producing coherent, complete, and human-aligned descriptions. To address this, we introduce FIOVA (Five-In-One Video Annotations), a human-centric benchmark tailored for evaluation. It comprises 3,002 real-world videos (about 33.6s each), each annotated independently by five annotators. This design enables modeling of semantic diversity and inter-subjective agreement, offering a richer foundation for measuring human-machine alignment. We further propose FIOVA-DQ, an event-level evaluation metric that incorporates cognitive weights derived from annotator consensus, providing fine-grained assessment of event relevance and semantic coverage. Leveraging FIOVA, we conduct a comprehensive evaluation of nine representative LVLMs and introduce a complexity-aware analysis framework based on inter-annotator variation (CV). This reveals consistency gaps across difficulty levels and identifies structural issues such as event under-description and template convergence. Our results highlight FIOVA's diagnostic value for understanding LVLM behavior under varying complexity, setting a new standard for cognitively aligned evaluation in long-video captioning. The benchmark, annotations, metric, and model outputs are publicly released to support future evaluation-driven research in video understanding. More detailed information can be found at https://huuuuusy.github.io/fiova/.

FIOVA: A Multi-Annotator Benchmark for Human-Aligned Video Captioning

TL;DR

FIOVA introduces a cognitively aligned benchmark for long-video captioning by collecting multi-annotator descriptions (Five-In-One Video Annotations) and synthesizing a unified groundtruth via GPT. It adds FIOVA-DQ, a cognitively weighted event-level metric, and a three-tier evaluation framework (lexical, event-based AutoDQ, and cognitive FIOVA-DQ) to diagnose LVLM alignment with human perception. The study benchmarks nine LVLMs, analyzes inter-annotator variability with CV, and examines performance on a challenging FIOVA_hard subset, revealing persistent coverage gaps and narrative coherence issues. Collectively, FIOVA provides a diagnostic tool and evaluation standard to guide the development of more human-aligned, temporally coherent long-video understanding models.

Abstract

Despite rapid progress in large vision-language models (LVLMs), existing video caption benchmarks remain limited in evaluating their alignment with human understanding. Most rely on a single annotation per video and lexical similarity-based metrics, failing to capture the variability in human perception and the cognitive importance of events. These limitations hinder accurate diagnosis of model capabilities in producing coherent, complete, and human-aligned descriptions. To address this, we introduce FIOVA (Five-In-One Video Annotations), a human-centric benchmark tailored for evaluation. It comprises 3,002 real-world videos (about 33.6s each), each annotated independently by five annotators. This design enables modeling of semantic diversity and inter-subjective agreement, offering a richer foundation for measuring human-machine alignment. We further propose FIOVA-DQ, an event-level evaluation metric that incorporates cognitive weights derived from annotator consensus, providing fine-grained assessment of event relevance and semantic coverage. Leveraging FIOVA, we conduct a comprehensive evaluation of nine representative LVLMs and introduce a complexity-aware analysis framework based on inter-annotator variation (CV). This reveals consistency gaps across difficulty levels and identifies structural issues such as event under-description and template convergence. Our results highlight FIOVA's diagnostic value for understanding LVLM behavior under varying complexity, setting a new standard for cognitively aligned evaluation in long-video captioning. The benchmark, annotations, metric, and model outputs are publicly released to support future evaluation-driven research in video understanding. More detailed information can be found at https://huuuuusy.github.io/fiova/.

Paper Structure

This paper contains 69 sections, 23 figures, 24 tables, 3 algorithms.

Figures (23)

  • Figure 1: Overview of FIOVA. The workflow comprises three steps: (i) dataset construction (Sec. \ref{['step1']}), (ii) LVLM response collection (Sec. \ref{['step2']}), and (iii) fine-grained evaluation and analysis (Sec. \ref{['step3']}). Together, these steps form FIOVA for systematically comparing human and LVLM video understanding.
  • Figure 2: Statistical overview of FIOVA: (a) Average frame count per video across thematic categories (see Tab. \ref{['tab:fiova-theme']}); (b) Distribution of annotation lengths among five annotators; (c) Correlation between annotation length and video duration; (d) Word cloud based on GPT-synthesized groundtruth captions.
  • Figure 3: Distribution of human annotation scores across five dimensions (a-e) and variation among annotators (f), measured using the coefficient of variation (CV).
  • Figure 4: An example of FIOVA and the calculation process of FIOVA-DQ (see Fig. \ref{['fig:acc15']}).
  • Figure 5: The distribution of response length.
  • ...and 18 more figures