Table of Contents
Fetching ...

Octopus: Alleviating Hallucination via Dynamic Contrastive Decoding

Wei Suo, Lijun Zhang, Mengyang Sun, Lin Yuanbo Wu, Peng Wang, Yanning Zhang

TL;DR

This work tackles hallucination in large vision-language models by revealing that hallucination causes are hybrid and step-specific. It introduces Octopus, an adaptive framework that uses a learnable decision token to dynamically select among multiple Contrastive Decoding strategies (VCD, M3ID, AVISC) at each generation step, forming a flexible decoding workflow without retraining LVLM weights. Optimized via Direct Preference Optimization, Octopus achieves state-of-the-art performance on generative and discriminative benchmarks while remaining deployable and extensible. The approach offers practical impact by reducing fabricated content in multi-modal systems and enabling easy integration of additional CD strategies as tentacles of the Octopus.

Abstract

Large Vision-Language Models (LVLMs) have obtained impressive performance in visual content understanding and multi-modal reasoning. Unfortunately, these large models suffer from serious hallucination problems and tend to generate fabricated responses. Recently, several Contrastive Decoding (CD) strategies have been proposed to alleviate hallucination by introducing disturbed inputs. Although great progress has been made, these CD strategies mostly apply a one-size-fits-all approach for all input conditions. In this paper, we revisit this process through extensive experiments. Related results show that hallucination causes are hybrid and each generative step faces a unique hallucination challenge. Leveraging these meaningful insights, we introduce a simple yet effective Octopus-like framework that enables the model to adaptively identify hallucination types and create a dynamic CD workflow. Our Octopus framework not only outperforms existing methods across four benchmarks but also demonstrates excellent deployability and expansibility. Code is available at https://github.com/LijunZhang01/Octopus.

Octopus: Alleviating Hallucination via Dynamic Contrastive Decoding

TL;DR

This work tackles hallucination in large vision-language models by revealing that hallucination causes are hybrid and step-specific. It introduces Octopus, an adaptive framework that uses a learnable decision token to dynamically select among multiple Contrastive Decoding strategies (VCD, M3ID, AVISC) at each generation step, forming a flexible decoding workflow without retraining LVLM weights. Optimized via Direct Preference Optimization, Octopus achieves state-of-the-art performance on generative and discriminative benchmarks while remaining deployable and extensible. The approach offers practical impact by reducing fabricated content in multi-modal systems and enabling easy integration of additional CD strategies as tentacles of the Octopus.

Abstract

Large Vision-Language Models (LVLMs) have obtained impressive performance in visual content understanding and multi-modal reasoning. Unfortunately, these large models suffer from serious hallucination problems and tend to generate fabricated responses. Recently, several Contrastive Decoding (CD) strategies have been proposed to alleviate hallucination by introducing disturbed inputs. Although great progress has been made, these CD strategies mostly apply a one-size-fits-all approach for all input conditions. In this paper, we revisit this process through extensive experiments. Related results show that hallucination causes are hybrid and each generative step faces a unique hallucination challenge. Leveraging these meaningful insights, we introduce a simple yet effective Octopus-like framework that enables the model to adaptively identify hallucination types and create a dynamic CD workflow. Our Octopus framework not only outperforms existing methods across four benchmarks but also demonstrates excellent deployability and expansibility. Code is available at https://github.com/LijunZhang01/Octopus.

Paper Structure

This paper contains 20 sections, 8 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Paradigm comparison of different hallucination alleviation methods. (a) Retraining method. Constructing high-quality data to retrain these LVLMs. (b) Contrastive Decoding. Comparing the output distributions from the original and distorted inputs. (c) Octopus. Our method focuses on dynamically selecting suitable strategies to reduce hallucinations caused by various factors.
  • Figure 2: The proportion of effective samples using different CD methods for (a) Generative Task and (b) Discriminative Task. We observe that each CD strategy can only address part of the samples.
  • Figure 3: Token-level hallucination quantitative evaluation. We enumerate different CD strategies at each time step. The results show that using multiple CD strategies obtains better performance.
  • Figure 4: Token-level hallucination qualitative analysis. For simplicity, we only present the attention map across the top 5 visual tokens and corresponding keywords. The results show that hallucination causes are hybrid in a sample.
  • Figure 5: Overview of our method. Our Octopus framework consists of two key components: the decision token $eye$ and its tentacles. Specifically, we first utilize the "$eye$" to identify the types of hallucinations, and then these "tentacles" are applied to address specific hallucination issues at each generative step. Finally, our model would be optimized by DPO or other reinforcement learning methods.
  • ...and 1 more figures