Table of Contents
Fetching ...

Mitigating Hallucination in Visual-Language Models via Re-Balancing Contrastive Decoding

Xiaoyu Liang, Jiayuan Yu, Lianrui Mu, Jiedong Zhuang, Jiaqi Hu, Yuchen Yang, Jiangnan Ye, Lu Lu, Jian Chen, Haoji Hu

TL;DR

The proposed Re-Balancing Contrastive Decoding (RBD) method employs textual and visual branches to recalibrate attention distribution in VLMs, which enables the RBD method to diminish textual bias while enhancing visual information.

Abstract

Although Visual-Language Models (VLMs) have shown impressive capabilities in tasks like visual question answering and image captioning, they still struggle with hallucinations. Analysis of attention distribution in these models shows that VLMs tend to processing textual tokens rather than visual tokens. This imbalance of attention distribution causes VLMs to favor textual knowledge in the case of multimodal knowledge conflicts, resulting in differences from the image information. In this paper, we propose Re-Balancing Contrastive Decoding (RBD) method, which employs textual and visual branches to recalibrate attention distribution in VLMs. Specifically, the textual branch injects image noise to stimulate the model's dependency on text, thereby reducing textual bias. Concurrently, the visual branch focuses on the selection of significant tokens, refining the attention mechanism to highlight the primary subject. This dual-branch strategy enables the RBD method to diminish textual bias while enhancing visual information. Experimental results demonstrate that our method, RBD, outperforms the existing methods by the CHAIR and POPE metrics, mitigate hallucinations without reducing the model's general capabilities.

Mitigating Hallucination in Visual-Language Models via Re-Balancing Contrastive Decoding

TL;DR

The proposed Re-Balancing Contrastive Decoding (RBD) method employs textual and visual branches to recalibrate attention distribution in VLMs, which enables the RBD method to diminish textual bias while enhancing visual information.

Abstract

Although Visual-Language Models (VLMs) have shown impressive capabilities in tasks like visual question answering and image captioning, they still struggle with hallucinations. Analysis of attention distribution in these models shows that VLMs tend to processing textual tokens rather than visual tokens. This imbalance of attention distribution causes VLMs to favor textual knowledge in the case of multimodal knowledge conflicts, resulting in differences from the image information. In this paper, we propose Re-Balancing Contrastive Decoding (RBD) method, which employs textual and visual branches to recalibrate attention distribution in VLMs. Specifically, the textual branch injects image noise to stimulate the model's dependency on text, thereby reducing textual bias. Concurrently, the visual branch focuses on the selection of significant tokens, refining the attention mechanism to highlight the primary subject. This dual-branch strategy enables the RBD method to diminish textual bias while enhancing visual information. Experimental results demonstrate that our method, RBD, outperforms the existing methods by the CHAIR and POPE metrics, mitigate hallucinations without reducing the model's general capabilities.
Paper Structure (26 sections, 10 equations, 3 figures, 3 tables)

This paper contains 26 sections, 10 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Imbalance in Multimodal Knowledge Processing.LLaVA tends to processing RGB]248,203,173textual rather than RGB]180,199,231visual information. LLaVAv1.5 assume the presence of apples in a fruit shop, even if there is no apple in the image. This assumption is influenced by the inherent textual knowledge stored in the LLM-backbone, thereby creating hallucinations. Words marked in red and green show incorrect and correct information, respectively.
  • Figure 2: Overview of our RBD, which is designed to calibrate the model's preference for textual and visual knowledge in order to mitigate the hallucinations. On the left side, logits derived/obtained from RGB]251,229,214textual and RGB]222,235,247visual branches are integrated to refine the distribution of RGB]255,230,153original logits produced by VLM. This process amplify the predictions from visual branch while diminishing the untruthful predictions from textual branch, resulting in the final, RGB]169,209,142rebalanced logits depicted on the right side.
  • Figure 3: Results when using different hyperparameters on LLaVAv1.5-7B. Figures show the Accurary metric in POPE. Bigger values indicate fewer hallucinations.