FIOVA: A Multi-Annotator Benchmark for Human-Aligned Video Captioning

Shiyu Hu; Xuchen Li; Xuzhao Li; Jing Zhang; Yipei Wang; Xin Zhao; Kang Hao Cheong

FIOVA: A Multi-Annotator Benchmark for Human-Aligned Video Captioning

Shiyu Hu, Xuchen Li, Xuzhao Li, Jing Zhang, Yipei Wang, Xin Zhao, Kang Hao Cheong

TL;DR

FIOVA introduces a cognitively aligned benchmark for long-video captioning by collecting multi-annotator descriptions (Five-In-One Video Annotations) and synthesizing a unified groundtruth via GPT. It adds FIOVA-DQ, a cognitively weighted event-level metric, and a three-tier evaluation framework (lexical, event-based AutoDQ, and cognitive FIOVA-DQ) to diagnose LVLM alignment with human perception. The study benchmarks nine LVLMs, analyzes inter-annotator variability with CV, and examines performance on a challenging FIOVA_hard subset, revealing persistent coverage gaps and narrative coherence issues. Collectively, FIOVA provides a diagnostic tool and evaluation standard to guide the development of more human-aligned, temporally coherent long-video understanding models.

Abstract

Despite rapid progress in large vision-language models (LVLMs), existing video caption benchmarks remain limited in evaluating their alignment with human understanding. Most rely on a single annotation per video and lexical similarity-based metrics, failing to capture the variability in human perception and the cognitive importance of events. These limitations hinder accurate diagnosis of model capabilities in producing coherent, complete, and human-aligned descriptions. To address this, we introduce FIOVA (Five-In-One Video Annotations), a human-centric benchmark tailored for evaluation. It comprises 3,002 real-world videos (about 33.6s each), each annotated independently by five annotators. This design enables modeling of semantic diversity and inter-subjective agreement, offering a richer foundation for measuring human-machine alignment. We further propose FIOVA-DQ, an event-level evaluation metric that incorporates cognitive weights derived from annotator consensus, providing fine-grained assessment of event relevance and semantic coverage. Leveraging FIOVA, we conduct a comprehensive evaluation of nine representative LVLMs and introduce a complexity-aware analysis framework based on inter-annotator variation (CV). This reveals consistency gaps across difficulty levels and identifies structural issues such as event under-description and template convergence. Our results highlight FIOVA's diagnostic value for understanding LVLM behavior under varying complexity, setting a new standard for cognitively aligned evaluation in long-video captioning. The benchmark, annotations, metric, and model outputs are publicly released to support future evaluation-driven research in video understanding. More detailed information can be found at https://huuuuusy.github.io/fiova/.

FIOVA: A Multi-Annotator Benchmark for Human-Aligned Video Captioning

TL;DR

Abstract

FIOVA: A Multi-Annotator Benchmark for Human-Aligned Video Captioning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (23)