Table of Contents
Fetching ...

Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations

Shuo Li, Jiajun Sun, Guodong Zheng, Xiaoran Fan, Yujiong Shen, Yi Lu, Zhiheng Xi, Yuming Yang, Wenming Tan, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang

TL;DR

This work identifies object hallucinations in multimodal LLMs as a consequence of over-sensitivity to frequency-domain image features. It introduces Multi-Frequency Perturbations (MFP), a pluggable pipeline that extracts and fuses high- and low-frequency image features with original visual tokens through cross-attention, and applies inference-time attenuation to suppress redundant frequency information. The method demonstrates strong, architecture-agnostic improvements across CHAIR, POPE, MME, and MMBench benchmarks, and can further enhance performance when combined with existing SOTA approaches like PAI. Overall, MFP provides a practical, training-time compatible strategy to improve the reliability of MLLMs in object grounding and description tasks, with broad applicability across visual encoders and model scales.

Abstract

Recently, multimodal large language models (MLLMs) have demonstrated remarkable performance in visual-language tasks. However, the authenticity of the responses generated by MLLMs is often compromised by object hallucinations. We identify that a key cause of these hallucinations is the model's over-susceptibility to specific image frequency features in detecting objects. In this paper, we introduce Multi-Frequency Perturbations (MFP), a simple, cost-effective, and pluggable method that leverages both low-frequency and high-frequency features of images to perturb visual feature representations and explicitly suppress redundant frequency-domain features during inference, thereby mitigating hallucinations. Experimental results demonstrate that our method significantly mitigates object hallucinations across various model architectures. Furthermore, as a training-time method, MFP can be combined with inference-time methods to achieve state-of-the-art performance on the CHAIR benchmark.

Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations

TL;DR

This work identifies object hallucinations in multimodal LLMs as a consequence of over-sensitivity to frequency-domain image features. It introduces Multi-Frequency Perturbations (MFP), a pluggable pipeline that extracts and fuses high- and low-frequency image features with original visual tokens through cross-attention, and applies inference-time attenuation to suppress redundant frequency information. The method demonstrates strong, architecture-agnostic improvements across CHAIR, POPE, MME, and MMBench benchmarks, and can further enhance performance when combined with existing SOTA approaches like PAI. Overall, MFP provides a practical, training-time compatible strategy to improve the reliability of MLLMs in object grounding and description tasks, with broad applicability across visual encoders and model scales.

Abstract

Recently, multimodal large language models (MLLMs) have demonstrated remarkable performance in visual-language tasks. However, the authenticity of the responses generated by MLLMs is often compromised by object hallucinations. We identify that a key cause of these hallucinations is the model's over-susceptibility to specific image frequency features in detecting objects. In this paper, we introduce Multi-Frequency Perturbations (MFP), a simple, cost-effective, and pluggable method that leverages both low-frequency and high-frequency features of images to perturb visual feature representations and explicitly suppress redundant frequency-domain features during inference, thereby mitigating hallucinations. Experimental results demonstrate that our method significantly mitigates object hallucinations across various model architectures. Furthermore, as a training-time method, MFP can be combined with inference-time methods to achieve state-of-the-art performance on the CHAIR benchmark.

Paper Structure

This paper contains 35 sections, 9 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: An example of GPT-4o. Unlike humans, the model is over-susceptible to limited high- and low-frequency image features to detect objects, leading to incorrect object detection and erroneous image caption.
  • Figure 2: Instance-level hallucination rate when using only low or high frequency features. The x-axis represents the cutoff frequency. Features with frequencies higher than the cutoff are retained as high-frequency features, while those below the cutoff are selected as low-frequency features.
  • Figure 3: The model architecture of our proposed method. Where $\gamma$ is only employed at inference time.
  • Figure 4: Results of sensitivity analysis on CHAIR benchmark for the parameter $\gamma$. The experiments are conducted on the LLaVA-1.5-7B model.
  • Figure 5: Comparison between our proposed MFP method and the original output in some cases. The hallucinating responses are highlighted in red.