Table of Contents
Fetching ...

PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training

Cong Chen, Mingyu Liu, Chenchen Jing, Yizhou Zhou, Fengyun Rao, Hao Chen, Bo Zhang, Chunhua Shen

TL;DR

The paper tackles multimodal hallucinations in dense image captioning by introducing HalFscore, a graph-based, concept-level metric that captures accuracy and completeness of captions. It also identifies language priors as a root cause and proposes PerturboLLaVA, a training framework that inserts adversarial perturbations into text inputs to force models to ground predictions in visual content, avoiding extra inference cost. Experimental results on LLaVA1.5 show HalFscore improvements and better performance across CHAIR, HallusionBench, and general multimodal benchmarks, with analyses supporting robustness to perturbation variations. The work offers a scalable, cost-efficient approach that can complement decoding strategies and potentially establish a new standard for evaluating and mitigating multimodal hallucinations.

Abstract

This paper aims to address the challenge of hallucinations in Multimodal Large Language Models (MLLMs) particularly for dense image captioning tasks. To tackle the challenge, we identify the current lack of a metric that finely measures the caption quality in concept level. We hereby introduce HalFscore, a novel metric built upon the language graph and is designed to evaluate both the accuracy and completeness of dense captions at a granular level. Additionally, we identify the root cause of hallucination as the model's over-reliance on its language prior. To address this, we propose PerturboLLaVA, which reduces the model's reliance on the language prior by incorporating adversarially perturbed text during training. This method enhances the model's focus on visual inputs, effectively reducing hallucinations and producing accurate, image-grounded descriptions without incurring additional computational overhead. PerturboLLaVA significantly improves the fidelity of generated captions, outperforming existing approaches in handling multimodal hallucinations and achieving improved performance across general multimodal benchmarks.

PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training

TL;DR

The paper tackles multimodal hallucinations in dense image captioning by introducing HalFscore, a graph-based, concept-level metric that captures accuracy and completeness of captions. It also identifies language priors as a root cause and proposes PerturboLLaVA, a training framework that inserts adversarial perturbations into text inputs to force models to ground predictions in visual content, avoiding extra inference cost. Experimental results on LLaVA1.5 show HalFscore improvements and better performance across CHAIR, HallusionBench, and general multimodal benchmarks, with analyses supporting robustness to perturbation variations. The work offers a scalable, cost-efficient approach that can complement decoding strategies and potentially establish a new standard for evaluating and mitigating multimodal hallucinations.

Abstract

This paper aims to address the challenge of hallucinations in Multimodal Large Language Models (MLLMs) particularly for dense image captioning tasks. To tackle the challenge, we identify the current lack of a metric that finely measures the caption quality in concept level. We hereby introduce HalFscore, a novel metric built upon the language graph and is designed to evaluate both the accuracy and completeness of dense captions at a granular level. Additionally, we identify the root cause of hallucination as the model's over-reliance on its language prior. To address this, we propose PerturboLLaVA, which reduces the model's reliance on the language prior by incorporating adversarially perturbed text during training. This method enhances the model's focus on visual inputs, effectively reducing hallucinations and producing accurate, image-grounded descriptions without incurring additional computational overhead. PerturboLLaVA significantly improves the fidelity of generated captions, outperforming existing approaches in handling multimodal hallucinations and achieving improved performance across general multimodal benchmarks.

Paper Structure

This paper contains 47 sections, 7 equations, 17 figures, 12 tables.

Figures (17)

  • Figure 1: The multimodal model is prone to hallucinate text due to the inherit language bias. Here, the hallucinated text is induced by the preceding concepts generated by the language model.
  • Figure 2: Comparison against the state of the art methods. Hallucinations are highlighted in red, whereas the image detailed descriptions are shown in blue. The proposed PerturboLLaVA describes rich image details more accurately.
  • Figure 3: The diagram of computing HalFscore. We construct the language graph to model both the concepts and their relationships for captions. We can then compare the graphs and identify the hallucinations, omissions and matchings between the two sets of concepts respectively.
  • Figure 4: Graph construction. We extract triplets from the caption and build the graph accordingly.
  • Figure 5: The generation of perturbation text and perturbative visual training of PerturboLLaVA.
  • ...and 12 more figures