Table of Contents
Fetching ...

VACoDe: Visual Augmented Contrastive Decoding

Sihyeon Kim, Boryeong Cho, Sangmin Bae, Sumyeong Ahn, Se-Young Yun

TL;DR

This paper tackles LVLM hallucinations by extending contrastive decoding to multiple visual augmentations and selecting the most contrastive option per query. VACoDe computes a distance-based score to identify the augmentation that maximizes contrast between original and augmented outputs, then applies CD with that augmentation during decoding, without requiring training or external data. Empirical results on MME, MMBench, and VQAv2 across multiple LVLM backbones demonstrate consistent performance gains over single-augmentation CD, highlighting the method's robustness and practicality. The work also provides insights into how visual perturbations interact with language generation and offers a path toward task-aware, augmentation-driven decoding for diverse vision-language tasks.

Abstract

Despite the astonishing performance of recent Large Vision-Language Models (LVLMs), these models often generate inaccurate responses. To address this issue, previous studies have focused on mitigating hallucinations by employing contrastive decoding (CD) with augmented images, which amplifies the contrast with the original image. However, these methods have limitations, including reliance on a single augmentation, which is restrictive for certain tasks, as well as the high cost of using external knowledge. In this study, we address these limitations by exploring how to utilize multiple image augmentations. Through extensive experiments, we observed that different augmentations produce varying levels of contrast depending on the task. Based on this observation, we introduce a novel method called VACoDe, Visual Augmented Contrastive Decoding. This method adaptively selects the augmentation with the highest contrast for each task using the proposed softmax distance metric. Our empirical tests show that \alg outperforms previous methods and improves output quality in various vision-language tasks. Additionally, VACoDe can be universally applied across different model types and sizes without additional training or the use of external models and data.

VACoDe: Visual Augmented Contrastive Decoding

TL;DR

This paper tackles LVLM hallucinations by extending contrastive decoding to multiple visual augmentations and selecting the most contrastive option per query. VACoDe computes a distance-based score to identify the augmentation that maximizes contrast between original and augmented outputs, then applies CD with that augmentation during decoding, without requiring training or external data. Empirical results on MME, MMBench, and VQAv2 across multiple LVLM backbones demonstrate consistent performance gains over single-augmentation CD, highlighting the method's robustness and practicality. The work also provides insights into how visual perturbations interact with language generation and offers a path toward task-aware, augmentation-driven decoding for diverse vision-language tasks.

Abstract

Despite the astonishing performance of recent Large Vision-Language Models (LVLMs), these models often generate inaccurate responses. To address this issue, previous studies have focused on mitigating hallucinations by employing contrastive decoding (CD) with augmented images, which amplifies the contrast with the original image. However, these methods have limitations, including reliance on a single augmentation, which is restrictive for certain tasks, as well as the high cost of using external knowledge. In this study, we address these limitations by exploring how to utilize multiple image augmentations. Through extensive experiments, we observed that different augmentations produce varying levels of contrast depending on the task. Based on this observation, we introduce a novel method called VACoDe, Visual Augmented Contrastive Decoding. This method adaptively selects the augmentation with the highest contrast for each task using the proposed softmax distance metric. Our empirical tests show that \alg outperforms previous methods and improves output quality in various vision-language tasks. Additionally, VACoDe can be universally applied across different model types and sizes without additional training or the use of external models and data.
Paper Structure (18 sections, 5 equations, 8 figures, 6 tables, 1 algorithm)

This paper contains 18 sections, 5 equations, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: Overview of the problem we focus on: When dealing with LVLMs, selecting the appropriate augmentation for each query is crucial to enhance decoding performance. For example, if the question is "Where is the cat?" and the correct answer is right, applying flip augmentation can alter the input image, resulting in a contrastive answer, left. This contrastive information is beneficial for increasing the answer's probability when using CD. Conversely, using color augmentation for this question is unsuitable, as it does not generate contrastive output distributions. Therefore, the main challenge is how to adaptively select the most effective augmentation to improve CD performance in LVLMs.
  • Figure 2: The examples of visual augmentation outputs utilized in this paper.
  • Figure 3: A detailed analysis of augmentation-question pairs reveals that (a) in color-type query, color augmentation produces a contrastive distribution, whereas flipping does not. Similarly, (b) shows that the existence query is influenced by random cropping.
  • Figure 4: On each question type in MME dataset, (a) MME score drop of augmented images and (b) softmax output gain after CD are measured on different augmentations.
  • Figure 5: The softmax output of ground truth increases after CD. Top1, the augmentation with the biggest distance, gets the best increment, which is greater than the results of single augmentations.
  • ...and 3 more figures