Table of Contents
Fetching ...

EFUF: Efficient Fine-grained Unlearning Framework for Mitigating Hallucinations in Multimodal Large Language Models

Shangyu Xing, Fei Zhao, Zhen Wu, Tuo An, Weihao Chen, Chunhui Li, Jianbing Zhang, Xinyu Dai

TL;DR

EFUF introduces an efficient fine-grained unlearning framework to mitigate object hallucinations in multimodal LLMs without requiring paired data. It leverages CLIP-based text-image congruence to construct positive and negative subsentence samples and applies three losses (positive, negative, sentence) to unlearn hallucinated object alignments while preserving fluent, coherent long-form text. Across multiple MLLMs and a COCO-derived evaluation setup, EFUF achieves substantial reductions in hallucination rates and improvements in generation quality with markedly lower training costs and annotation burdens than RLHF- or DPO-based methods. The approach demonstrates strong compatibility with existing hallucination mitigation techniques and offers a scalable, data-efficient path toward more reliable multimodal generation systems.

Abstract

Multimodal large language models (MLLMs) have attracted increasing attention in the past few years, but they may still generate descriptions that include objects not present in the corresponding images, a phenomenon known as object hallucination. To eliminate hallucinations, existing methods manually annotate paired responses with and without hallucinations, and then employ various alignment algorithms to improve the alignment capability between images and text. However, they not only demand considerable computation resources during the finetuning stage but also require expensive human annotation to construct paired data needed by the alignment algorithms. To address these issues, we borrow the idea of unlearning and propose an efficient fine-grained unlearning framework (EFUF), which can eliminate hallucinations without the need for paired data. Extensive experiments show that our method consistently reduces hallucinations while preserving the generation quality with modest computational overhead. Our code and datasets will be publicly available.

EFUF: Efficient Fine-grained Unlearning Framework for Mitigating Hallucinations in Multimodal Large Language Models

TL;DR

EFUF introduces an efficient fine-grained unlearning framework to mitigate object hallucinations in multimodal LLMs without requiring paired data. It leverages CLIP-based text-image congruence to construct positive and negative subsentence samples and applies three losses (positive, negative, sentence) to unlearn hallucinated object alignments while preserving fluent, coherent long-form text. Across multiple MLLMs and a COCO-derived evaluation setup, EFUF achieves substantial reductions in hallucination rates and improvements in generation quality with markedly lower training costs and annotation burdens than RLHF- or DPO-based methods. The approach demonstrates strong compatibility with existing hallucination mitigation techniques and offers a scalable, data-efficient path toward more reliable multimodal generation systems.

Abstract

Multimodal large language models (MLLMs) have attracted increasing attention in the past few years, but they may still generate descriptions that include objects not present in the corresponding images, a phenomenon known as object hallucination. To eliminate hallucinations, existing methods manually annotate paired responses with and without hallucinations, and then employ various alignment algorithms to improve the alignment capability between images and text. However, they not only demand considerable computation resources during the finetuning stage but also require expensive human annotation to construct paired data needed by the alignment algorithms. To address these issues, we borrow the idea of unlearning and propose an efficient fine-grained unlearning framework (EFUF), which can eliminate hallucinations without the need for paired data. Extensive experiments show that our method consistently reduces hallucinations while preserving the generation quality with modest computational overhead. Our code and datasets will be publicly available.
Paper Structure (38 sections, 13 equations, 5 figures, 7 tables)

This paper contains 38 sections, 13 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: An example of hallucination in MLLM.
  • Figure 2: Comparison of hallucinated and non-hallucinated objects generated by MiniGPT4 (a) and LLaVA (b) on image-relevance scores.
  • Figure 3: An overview of EFUF. EFUF is divided into two stages: dataset formation and unlearning process. Initially, we extract objects from generated captions and calculate their image relevance utilizing CLIP, followed by the construction of three datasets. Subsequently, three corresponding losses are tailored to finetune the model.
  • Figure 4: Training time comparison of EFUF with other finetuning-based methods (A100 GPU hours).
  • Figure 5: Responses of MiniGPT4 with different methods.