Table of Contents
Fetching ...

Hallucination Localization in Video Captioning

Shota Nakada, Kazuhiro Saito, Yuchi Ishikawa, Hokuto Munakata, Tatsuya Komatsu, Masayoshi Kondo

TL;DR

This work tackles hallucinations in video captioning by introducing hallucination localization at the span level and presenting the HLVC-Dataset with 1,167 annotated video-caption pairs. It proposes an instruction-tuning baseline for VideoLLMs to output hallucinated spans, leveraging a three-stage data generation pipeline that injects synthetic errors. Experiments show that instruction-tuned models substantially outperform zero-shot baselines in both token- and span-level metrics, with span-level accuracy reaching $30.1\%$ on the HLVC test set and synthetic data significantly boosting performance in data-efficient regimes. The study provides a practical, fine-grained benchmarking framework and a scalable approach to improve honesty and interpretability in video captioning systems.

Abstract

We propose a novel task, hallucination localization in video captioning, which aims to identify hallucinations in video captions at the span level (i.e. individual words or phrases). This allows for a more detailed analysis of hallucinations compared to existing sentence-level hallucination detection task. To establish a benchmark for hallucination localization, we construct HLVC-Dataset, a carefully curated dataset created by manually annotating 1,167 video-caption pairs from VideoLLM-generated captions. We further implement a VideoLLM-based baseline method and conduct quantitative and qualitative evaluations to benchmark current performance on hallucination localization.

Hallucination Localization in Video Captioning

TL;DR

This work tackles hallucinations in video captioning by introducing hallucination localization at the span level and presenting the HLVC-Dataset with 1,167 annotated video-caption pairs. It proposes an instruction-tuning baseline for VideoLLMs to output hallucinated spans, leveraging a three-stage data generation pipeline that injects synthetic errors. Experiments show that instruction-tuned models substantially outperform zero-shot baselines in both token- and span-level metrics, with span-level accuracy reaching on the HLVC test set and synthetic data significantly boosting performance in data-efficient regimes. The study provides a practical, fine-grained benchmarking framework and a scalable approach to improve honesty and interpretability in video captioning systems.

Abstract

We propose a novel task, hallucination localization in video captioning, which aims to identify hallucinations in video captions at the span level (i.e. individual words or phrases). This allows for a more detailed analysis of hallucinations compared to existing sentence-level hallucination detection task. To establish a benchmark for hallucination localization, we construct HLVC-Dataset, a carefully curated dataset created by manually annotating 1,167 video-caption pairs from VideoLLM-generated captions. We further implement a VideoLLM-based baseline method and conduct quantitative and qualitative evaluations to benchmark current performance on hallucination localization.

Paper Structure

This paper contains 9 sections, 2 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Comparison between the hallucination detection and hallucination localization. Given a video and its caption, hallucination detection classifies whether the caption contains hallucinated content. In contrast, our proposed hallucination localization identifies the text span responsible for the hallucination.
  • Figure 2: Overview of instruction tuning framework. The procedure is as follows: Step 1 generates seed captions using an existing VideoLLM. Step 2 automatically inserts errors into these seed captions using an LLM (LLaMA3.3). Step 3 formats the error-inserted captions as instruction data for VideoLLMs. Step 4 performs instruction tuning on VideoLLMs specifically for hallucination localization, enabling the tuned model to output hallucinated spans in the input video captions.
  • Figure 3: Qualitative evaluation of hallucination localization. The first column lists the input video, the second the input caption, the third the model output in the zero-shot setting, and the fourth the model output produced with our instruction-tuned method. The spans highlighted in red within the input caption indicate hallucinated spans.
  • Figure 4: Comparison of human annotation and synthetic data in instruction data. The horizontal axis represents the number of instruction data, while the vertical axis represents the token-level F$_{0.5}$.