Table of Contents
Fetching ...

NoiseBoost: Alleviating Hallucination with Noise Perturbation for Multimodal Large Language Models

Kai Wu, Boyuan Jiang, Zhengkai Jiang, Qingdong He, Donghao Luo, Shengzhi Wang, Qingwen Liu, Chengjie Wang

TL;DR

NoiseBoost identifies hallucinations in multimodal large language models as a consequence of over-reliance on language priors and proposes a simple noise perturbation to visual tokens as a regularizer. By injecting Gaussian noise with scale $0.5$ into projected visual features, NoiseBoost redistributes attention between visual and linguistic tokens, reducing anchor-based hallucinations. It demonstrates consistent gains across supervised fine-tuning, reinforcement learning, and semi-supervised learning, including an $8.1\%$ improvement in dense-caption accuracy as judged by humans, and enables SSL that matches 50% labeled-data performance using unlabeled data. The method adds negligible overhead and integrates with existing training pipelines, offering a practical path to more faithful MLLMs.

Abstract

Multimodal large language models (MLLMs) contribute a powerful mechanism to understanding visual information building on large language models. However, MLLMs are notorious for suffering from hallucinations, especially when generating lengthy, detailed descriptions for images. Our analysis reveals that hallucinations stem from the inherent summarization mechanism of large language models, leading to excessive dependence on linguistic tokens while neglecting vision information. In this paper, we propose NoiseBoost, a broadly applicable and simple method for alleviating hallucinations for MLLMs through the integration of noise feature perturbations. Noise perturbation acts as a regularizer, facilitating a balanced distribution of attention weights among visual and linguistic tokens. Despite its simplicity, NoiseBoost consistently enhances the performance of MLLMs across common training strategies, including supervised fine-tuning and reinforcement learning. Further, NoiseBoost pioneerly enables semi-supervised learning for MLLMs, unleashing the power of unlabeled data. Comprehensive experiments demonstrate that NoiseBoost improves dense caption accuracy by 8.1% with human evaluation and achieves comparable results with 50% of the data by mining unlabeled data. Code and models are available at https://kaiwu5.github.io/noiseboost.

NoiseBoost: Alleviating Hallucination with Noise Perturbation for Multimodal Large Language Models

TL;DR

NoiseBoost identifies hallucinations in multimodal large language models as a consequence of over-reliance on language priors and proposes a simple noise perturbation to visual tokens as a regularizer. By injecting Gaussian noise with scale into projected visual features, NoiseBoost redistributes attention between visual and linguistic tokens, reducing anchor-based hallucinations. It demonstrates consistent gains across supervised fine-tuning, reinforcement learning, and semi-supervised learning, including an improvement in dense-caption accuracy as judged by humans, and enables SSL that matches 50% labeled-data performance using unlabeled data. The method adds negligible overhead and integrates with existing training pipelines, offering a practical path to more faithful MLLMs.

Abstract

Multimodal large language models (MLLMs) contribute a powerful mechanism to understanding visual information building on large language models. However, MLLMs are notorious for suffering from hallucinations, especially when generating lengthy, detailed descriptions for images. Our analysis reveals that hallucinations stem from the inherent summarization mechanism of large language models, leading to excessive dependence on linguistic tokens while neglecting vision information. In this paper, we propose NoiseBoost, a broadly applicable and simple method for alleviating hallucinations for MLLMs through the integration of noise feature perturbations. Noise perturbation acts as a regularizer, facilitating a balanced distribution of attention weights among visual and linguistic tokens. Despite its simplicity, NoiseBoost consistently enhances the performance of MLLMs across common training strategies, including supervised fine-tuning and reinforcement learning. Further, NoiseBoost pioneerly enables semi-supervised learning for MLLMs, unleashing the power of unlabeled data. Comprehensive experiments demonstrate that NoiseBoost improves dense caption accuracy by 8.1% with human evaluation and achieves comparable results with 50% of the data by mining unlabeled data. Code and models are available at https://kaiwu5.github.io/noiseboost.
Paper Structure (20 sections, 4 equations, 9 figures, 7 tables)

This paper contains 20 sections, 4 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: MLLMs suffer from hallucinations due to the over-reliance on language priors. In (a), the hallucination tokens are overly dependent (0.27) on previous language tokens, and later tokens are all hallucinations. Meanwhile, in (b), NoiseBoost helps MLLMs distribute the attention weights evenly among visual and language tokens by noise perturbation, leading to honest results.
  • Figure 2: Framework of NoiseBoost. We add noise perturbation to visual tokens to mitigate the over-reliance on language tokens and thus fewer hallucinations. For SFT, we directly inject noise to visual features. We only inject perturbation to preferred response since that can make MLLMs harder to learn and achieve better results. For semi-supervised learning, we use freezed MLLM as a teacher to generate pseudo labels and NoiseBoost as students for consistency regularization.
  • Figure 3: Human evaluation of dense captions for 1k images with prompt "Describe the image and its style in detail". The correct category means the totally accurate image with other categories are errors identified by human annotators. NoiseBoost consistently reduce error on nearly all categories.
  • Figure 4: Analysis of the cause of hallucination is the over-reliance on language tokens circled in read which NoiseBoost doesn't have.
  • Figure 5: Qualitative evaluation shows that NoiseBoost can generate long detailed captions without hallucinations.
  • ...and 4 more figures