Table of Contents
Fetching ...

CATCH: Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs

Zhehan Kan, Ce Zhang, Zihan Liao, Yapeng Tian, Wenming Yang, Junyuan Xiao, Xu Li, Dongmei Jiang, Yaowei Wang, Qingmin Liao

TL;DR

CATCH addresses issues related to visual defects that cause diminished fine-grained feature perception and cumulative hallucinations in open-ended scenarios, and generalizes robustly to new tasks without additional training, opening new possibilities for advancing LVLM in various challenging applications.

Abstract

Large Vision-Language Model (LVLM) systems have demonstrated impressive vision-language reasoning capabilities but suffer from pervasive and severe hallucination issues, posing significant risks in critical domains such as healthcare and autonomous systems. Despite previous efforts to mitigate hallucinations, a persistent issue remains: visual defect from vision-language misalignment, creating a bottleneck in visual processing capacity. To address this challenge, we develop Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs (CATCH), based on the Information Bottleneck theory. CATCH introduces Complementary Visual Decoupling (CVD) for visual information separation, Non-Visual Screening (NVS) for hallucination detection, and Adaptive Token-level Contrastive Decoding (ATCD) for hallucination mitigation. CATCH addresses issues related to visual defects that cause diminished fine-grained feature perception and cumulative hallucinations in open-ended scenarios. It is applicable to various visual question-answering tasks without requiring any specific data or prior knowledge, and generalizes robustly to new tasks without additional training, opening new possibilities for advancing LVLM in various challenging applications.

CATCH: Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs

TL;DR

CATCH addresses issues related to visual defects that cause diminished fine-grained feature perception and cumulative hallucinations in open-ended scenarios, and generalizes robustly to new tasks without additional training, opening new possibilities for advancing LVLM in various challenging applications.

Abstract

Large Vision-Language Model (LVLM) systems have demonstrated impressive vision-language reasoning capabilities but suffer from pervasive and severe hallucination issues, posing significant risks in critical domains such as healthcare and autonomous systems. Despite previous efforts to mitigate hallucinations, a persistent issue remains: visual defect from vision-language misalignment, creating a bottleneck in visual processing capacity. To address this challenge, we develop Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs (CATCH), based on the Information Bottleneck theory. CATCH introduces Complementary Visual Decoupling (CVD) for visual information separation, Non-Visual Screening (NVS) for hallucination detection, and Adaptive Token-level Contrastive Decoding (ATCD) for hallucination mitigation. CATCH addresses issues related to visual defects that cause diminished fine-grained feature perception and cumulative hallucinations in open-ended scenarios. It is applicable to various visual question-answering tasks without requiring any specific data or prior knowledge, and generalizes robustly to new tasks without additional training, opening new possibilities for advancing LVLM in various challenging applications.

Paper Structure

This paper contains 9 sections, 8 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Analysis of how varying decoupling levels affect ground-truth token probability. We utilize SAM to segment the original visual input into seven levels. The horizontal axis represents the number of segmented objects selected utilizing SAM, and the vertical axis represents the token probability. As irrelevant visual features unrelated to the target reduce, the probability of the hallucinated token "phone" decreases, while the probability of the ground-truth token "sandwich" increases.
  • Figure 2: LVLMs may generate responses that include hallucinations (e.g., "One person on the left side is holding a phone", where "sandwich" is hallucinated as "phone". First, the CVD method leverages SAM to decouple the original input image $v$ into the dual image $z_d$ and the residual image $z_r$, and introduces a non-visual input $z_n$. These four inputs are then passed into the LVLM to generate their corresponding output distributions: $logits_o$, $logits_d$, $logits_r$ and $logits_n$. The Jensen-Shannon Divergence (JSD) is computed between them to obtain $JSD_{on}$, $JSD_{mn}$, and $JSD_{cn}$. The NVS method compares $JSD_{mn}$ and $JSD_{cn}$, and the input with the greater distance is selected as the decoupled image (e.g., $z_r$). Next, ATCD selects the decoding strategy by comparing $JSD_{cn}$ and $JSD_{on}$, if $JSD_{cn}$ is greater, the decoupled image output distribution is employed to contrastively subtract the original distribution. Conversely, if $JSD_{on}$ is greater, the output distribution from the decoupled image is leveraged to contrastively enhance the weighted original distribution. Effectively correcting the hallucinated token (e.g., "phone" is successfully corrected to "sandwich"). Notably, this process is dynamically performed at each token generation step.
  • Figure 3: We randomly selected 1,000 instances from the MSCOCO lin2014microsoft dataset, masking key visual features to create a masked image and a exposed image and then calculated $JSD_{en}$ between the exposed image and the non-visual input, $JSD_{mn}$ between the masked image and the non-visual input. We conducted a single-sample experiment as shown in (\ref{['one-sample']}) and performed an extensive statistical analysis as presented in (\ref{['statistical analysis']}).
  • Figure 4: Results on MME Hallucination (\ref{['fig: MME']}) and CHAIR benchmark (\ref{['fig:CHAIR']}).
  • Figure 5: An analysis of cumulative hallucinations includes a single-sample experiment, as shown in (\ref{['fig:CHAIR2']}), and an extensive statistical analysis, as presented in (\ref{['fig:CHAIR3']})