Table of Contents
Fetching ...

Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment

Kai-Po Chang, Wei-Yuan Cheng, Chi-Pin Huang, Fu-En Yang, Yu-Chiang Frank Wang

TL;DR

This work tackles object and action hallucinations in multimodal video captioning by introducing SANTA, a Self-Augmented Contrastive Alignment framework. SANTA combines hallucinative self-augmentation to generate targeted negatives with a tracklet-phrase contrastive mechanism that grounds regional objects and relation-guided actions to visual and temporal phrases. The approach jointly optimizes video-level grounding and region/relationship alignment, significantly reducing hallucinations across multiple benchmarks (MiraData-9k, FactVC, VidHal) and improving Dream1k captioning and video QA. Extensive ablations demonstrate the robustness and component contributions, establishing SANTA as a strong solution for faithful video-language grounding in MLLMs.

Abstract

Recent advancement in multimodal LLMs (MLLMs) has demonstrated their remarkable capability to generate descriptive captions for input videos. However, these models suffer from factual inaccuracies in the generated descriptions, causing severe hallucination issues. While prior works have explored alleviating hallucinations for static images, jointly mitigating visual object and temporal action hallucinations for dynamic videos remains a challenging and unsolved task. To tackle this challenge, we propose a Self-Augmented Contrastive Alignment (SANTA) framework for enabling object and action faithfulness by exempting the spurious correlations and enforcing the emphasis on visual facts. SANTA employs a hallucinative self-augmentation scheme to identify the potential hallucinations that lie in the MLLM and transform the original captions to the contrasted negatives. Furthermore, we develop a tracklet-phrase contrastive alignment to match the regional objects and relation-guided actions with their corresponding visual and temporal phrases. Extensive experiments demonstrate that SANTA outperforms existing methods in alleviating object and action hallucinations, yielding superior performance on the hallucination examination benchmarks.

Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment

TL;DR

This work tackles object and action hallucinations in multimodal video captioning by introducing SANTA, a Self-Augmented Contrastive Alignment framework. SANTA combines hallucinative self-augmentation to generate targeted negatives with a tracklet-phrase contrastive mechanism that grounds regional objects and relation-guided actions to visual and temporal phrases. The approach jointly optimizes video-level grounding and region/relationship alignment, significantly reducing hallucinations across multiple benchmarks (MiraData-9k, FactVC, VidHal) and improving Dream1k captioning and video QA. Extensive ablations demonstrate the robustness and component contributions, establishing SANTA as a strong solution for faithful video-language grounding in MLLMs.

Abstract

Recent advancement in multimodal LLMs (MLLMs) has demonstrated their remarkable capability to generate descriptive captions for input videos. However, these models suffer from factual inaccuracies in the generated descriptions, causing severe hallucination issues. While prior works have explored alleviating hallucinations for static images, jointly mitigating visual object and temporal action hallucinations for dynamic videos remains a challenging and unsolved task. To tackle this challenge, we propose a Self-Augmented Contrastive Alignment (SANTA) framework for enabling object and action faithfulness by exempting the spurious correlations and enforcing the emphasis on visual facts. SANTA employs a hallucinative self-augmentation scheme to identify the potential hallucinations that lie in the MLLM and transform the original captions to the contrasted negatives. Furthermore, we develop a tracklet-phrase contrastive alignment to match the regional objects and relation-guided actions with their corresponding visual and temporal phrases. Extensive experiments demonstrate that SANTA outperforms existing methods in alleviating object and action hallucinations, yielding superior performance on the hallucination examination benchmarks.

Paper Structure

This paper contains 43 sections, 12 equations, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: Compared to existing MLLMs suffering from object and action hallucinations, our SANTA enhances faithfulness in describing both visual objects and temporal actions.
  • Figure 2: Overview of SANTA. We employ (a) Mitigating Video-Level Hallucination by applying Hallucinative Self-Augmentation to identify the highly potential hallucinated tokens in MLLM $\theta_M$ that deviate from ground truth words (e.g., synonyms or hypernyms) and then perform video-caption contrastive alignment. SANTA then (b) Mitigating Object nad Action-Level Hallucinations by Tracklet-Phrase Contrastive Alignment to align object and action tracklets with visual and temporal phrases while contrasting hallucinative negatives.
  • Figure 3: t-SNE visualization of the latent features of (a) video and caption, (b) object tracklets and phrases, and (c) action tracklets and phrases. For the w/o SANTA setting, we directly visualize features from LLaVA-Video llavavideo. Upon training with SANTA LLaVA-Video llavavideo improves the alignment between visual-language modalities while exempting from the hallucinative captions.
  • Figure 4: Qualitative comparison of video captions predicted by HACL hacl and SANTA. Note that words highlighted in green indicate action faithfulness, while those in red indicate action hallucination. Similarly, words in blue represent object faithfulness, whereas those in orange denote object hallucination. The examples of (a) and (b) are sampled from MiraData-9k mirabench and FactVC factvc, respectively.
  • Figure 5: Ablation study of hallucinative self-augmentation scheme. We ablate this scheme by replacing it with negatives generated directly by a text-only LLM (i.e., GPT-4 gpt4), following the prompting setup of HACL hacl. We report $\text{F1}_{\text{Obj}}$ and $\text{F1}_{\text{Act}}$ of HalFscore perturbollava to evaluate the effectiveness of mitigating object and action hallucinations, respectively.
  • ...and 3 more figures