VACoDe: Visual Augmented Contrastive Decoding
Sihyeon Kim, Boryeong Cho, Sangmin Bae, Sumyeong Ahn, Se-Young Yun
TL;DR
This paper tackles LVLM hallucinations by extending contrastive decoding to multiple visual augmentations and selecting the most contrastive option per query. VACoDe computes a distance-based score to identify the augmentation that maximizes contrast between original and augmented outputs, then applies CD with that augmentation during decoding, without requiring training or external data. Empirical results on MME, MMBench, and VQAv2 across multiple LVLM backbones demonstrate consistent performance gains over single-augmentation CD, highlighting the method's robustness and practicality. The work also provides insights into how visual perturbations interact with language generation and offers a path toward task-aware, augmentation-driven decoding for diverse vision-language tasks.
Abstract
Despite the astonishing performance of recent Large Vision-Language Models (LVLMs), these models often generate inaccurate responses. To address this issue, previous studies have focused on mitigating hallucinations by employing contrastive decoding (CD) with augmented images, which amplifies the contrast with the original image. However, these methods have limitations, including reliance on a single augmentation, which is restrictive for certain tasks, as well as the high cost of using external knowledge. In this study, we address these limitations by exploring how to utilize multiple image augmentations. Through extensive experiments, we observed that different augmentations produce varying levels of contrast depending on the task. Based on this observation, we introduce a novel method called VACoDe, Visual Augmented Contrastive Decoding. This method adaptively selects the augmentation with the highest contrast for each task using the proposed softmax distance metric. Our empirical tests show that \alg outperforms previous methods and improves output quality in various vision-language tasks. Additionally, VACoDe can be universally applied across different model types and sizes without additional training or the use of external models and data.
