Table of Contents
Fetching ...

Can Hallucination Correction Improve Video-Language Alignment?

Lingjun Zhao, Mingyang Xie, Paola Cascante-Bonilla, Hal Daumé, Kwonjoon Lee

TL;DR

This work tackles the problem of grounding video-language models by reframing hallucinations as a training signal. It introduces HACA, a self-training framework where a Video-LLM learns to distinguish whether a caption entails a video and, when necessary, to generate corrected captions, complemented by a masking-correction augmentation. Empirical results on VELOCITI and SSv2 demonstrate that HACA yields consistent gains in text-to-video retrieval and video-caption binding, while preserving zero-shot QA capabilities. The approach relies solely on ground-truth video descriptions, offering a practical path to improve spatio-temporal understanding in Video-LLMs without external annotation pipelines.

Abstract

Large Vision-Language Models often generate hallucinated content that is not grounded in its visual inputs. While prior work focuses on mitigating hallucinations, we instead explore leveraging hallucination correction as a training objective to improve video-language alignment. We introduce HACA, a self-training framework learning to correct hallucinations in descriptions that do not align with the video content. By identifying and correcting inconsistencies, HACA enhances the model's ability to align video and textual representations for spatio-temporal reasoning. Our experimental results show consistent gains in video-caption binding and text-to-video retrieval tasks, demonstrating that hallucination correction-inspired tasks serve as an effective strategy for improving vision and language alignment.

Can Hallucination Correction Improve Video-Language Alignment?

TL;DR

This work tackles the problem of grounding video-language models by reframing hallucinations as a training signal. It introduces HACA, a self-training framework where a Video-LLM learns to distinguish whether a caption entails a video and, when necessary, to generate corrected captions, complemented by a masking-correction augmentation. Empirical results on VELOCITI and SSv2 demonstrate that HACA yields consistent gains in text-to-video retrieval and video-caption binding, while preserving zero-shot QA capabilities. The approach relies solely on ground-truth video descriptions, offering a practical path to improve spatio-temporal understanding in Video-LLMs without external annotation pipelines.

Abstract

Large Vision-Language Models often generate hallucinated content that is not grounded in its visual inputs. While prior work focuses on mitigating hallucinations, we instead explore leveraging hallucination correction as a training objective to improve video-language alignment. We introduce HACA, a self-training framework learning to correct hallucinations in descriptions that do not align with the video content. By identifying and correcting inconsistencies, HACA enhances the model's ability to align video and textual representations for spatio-temporal reasoning. Our experimental results show consistent gains in video-caption binding and text-to-video retrieval tasks, demonstrating that hallucination correction-inspired tasks serve as an effective strategy for improving vision and language alignment.

Paper Structure

This paper contains 36 sections, 3 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Models tasked with determining whether a given video entails a caption, where the contrast caption closely resembles the correct one. HACA effectively differentiates between the correct caption (top) and the incorrect one (bottom), and corrects hallucination in the latter. In contrast, Video-LLaVA fails to distinguish between those captions or correct the hallucination.
  • Figure 2: Example of different finetuning objectives. The first column shows an example of the baseline entailment task. The second column shows an example of our proposed HACA task, where we finetune the model to output hallucination correction to justify the response. The third column shows an example of the masking correction task, where we input a masked version of the video description and finetune the model to predict the corrected one.
  • Figure 3: Mean Average Precision (mAP) scores for pretrained Video-LLaVA and models fine-tuned using various methods on zero-shot text-to-video retrieval tasks.
  • Figure 4: Success on binding and correction: HACA effectively assigns higher entailment probability $P_{yes}$ to the correct caption (top) than the incorrect one (bottom), unlike the entailment-finetuned model. HACA also accurately corrects the incorrect caption in its output.
  • Figure 5: Some successful and failure cases of HACA and the other models on the VELOCITI dataset. The red color in text indicates the incorrect text description.