Hallucination Localization in Video Captioning
Shota Nakada, Kazuhiro Saito, Yuchi Ishikawa, Hokuto Munakata, Tatsuya Komatsu, Masayoshi Kondo
TL;DR
This work tackles hallucinations in video captioning by introducing hallucination localization at the span level and presenting the HLVC-Dataset with 1,167 annotated video-caption pairs. It proposes an instruction-tuning baseline for VideoLLMs to output hallucinated spans, leveraging a three-stage data generation pipeline that injects synthetic errors. Experiments show that instruction-tuned models substantially outperform zero-shot baselines in both token- and span-level metrics, with span-level accuracy reaching $30.1\%$ on the HLVC test set and synthetic data significantly boosting performance in data-efficient regimes. The study provides a practical, fine-grained benchmarking framework and a scalable approach to improve honesty and interpretability in video captioning systems.
Abstract
We propose a novel task, hallucination localization in video captioning, which aims to identify hallucinations in video captions at the span level (i.e. individual words or phrases). This allows for a more detailed analysis of hallucinations compared to existing sentence-level hallucination detection task. To establish a benchmark for hallucination localization, we construct HLVC-Dataset, a carefully curated dataset created by manually annotating 1,167 video-caption pairs from VideoLLM-generated captions. We further implement a VideoLLM-based baseline method and conduct quantitative and qualitative evaluations to benchmark current performance on hallucination localization.
