Table of Contents
Fetching ...

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, Lijuan Wang

TL;DR

The paper tackles hallucination in large multi-modal models by introducing LRV-Instruction, a 400k GPT-4-generated visual instruction dataset containing positive and negative samples across 16 VL tasks, and GAVIE, a ground-truth-free evaluation framework. It demonstrates that finetuning MiniGPT4 and mPLUG-Owl on LRV-Instruction reduces hallucination and improves performance on public VL benchmarks, with a balanced positive/negative data mix yielding the best robustness. The work also analyzes the challenges posed by different negative instruction types and validates scalability via pseudo dense captions. Data and code are released to enable broader adoption and continued advancement in robust visual instruction tuning.

Abstract

Despite the promising progress in multi-modal tasks, current large multi-modal models (LMMs) are prone to hallucinating inconsistent descriptions with respect to the associated image and human instructions. This paper addresses this issue by introducing the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction. Our dataset comprises 400k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers. Unlike existing studies that primarily focus on positive instruction samples, we design LRV-Instruction to include both positive and negative instructions for more robust visual instruction tuning. Our negative instructions are designed at three semantic levels: (i) Nonexistent Object Manipulation, (ii) Existent Object Manipulation and (iii) Knowledge Manipulation. To efficiently measure the hallucination generated by LMMs, we propose GPT4-Assisted Visual Instruction Evaluation (GAVIE), a stable approach to evaluate visual instruction tuning like human experts. GAVIE does not require human-annotated groundtruth answers and can adapt to diverse instruction formats. We conduct comprehensive experiments to investigate the hallucination of LMMs. Our results demonstrate existing LMMs exhibit significant hallucinations when presented with our negative instructions, particularly Existent Object and Knowledge Manipulation instructions. Moreover, we successfully mitigate hallucination by finetuning MiniGPT4 and mPLUG-Owl on LRV-Instruction while improving performance on several public datasets compared to state-of-the-art methods. Additionally, we observed that a balanced ratio of positive and negative instances in the training data leads to a more robust model. Code and data are available at https://github.com/FuxiaoLiu/LRV-Instruction.

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

TL;DR

The paper tackles hallucination in large multi-modal models by introducing LRV-Instruction, a 400k GPT-4-generated visual instruction dataset containing positive and negative samples across 16 VL tasks, and GAVIE, a ground-truth-free evaluation framework. It demonstrates that finetuning MiniGPT4 and mPLUG-Owl on LRV-Instruction reduces hallucination and improves performance on public VL benchmarks, with a balanced positive/negative data mix yielding the best robustness. The work also analyzes the challenges posed by different negative instruction types and validates scalability via pseudo dense captions. Data and code are released to enable broader adoption and continued advancement in robust visual instruction tuning.

Abstract

Despite the promising progress in multi-modal tasks, current large multi-modal models (LMMs) are prone to hallucinating inconsistent descriptions with respect to the associated image and human instructions. This paper addresses this issue by introducing the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction. Our dataset comprises 400k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers. Unlike existing studies that primarily focus on positive instruction samples, we design LRV-Instruction to include both positive and negative instructions for more robust visual instruction tuning. Our negative instructions are designed at three semantic levels: (i) Nonexistent Object Manipulation, (ii) Existent Object Manipulation and (iii) Knowledge Manipulation. To efficiently measure the hallucination generated by LMMs, we propose GPT4-Assisted Visual Instruction Evaluation (GAVIE), a stable approach to evaluate visual instruction tuning like human experts. GAVIE does not require human-annotated groundtruth answers and can adapt to diverse instruction formats. We conduct comprehensive experiments to investigate the hallucination of LMMs. Our results demonstrate existing LMMs exhibit significant hallucinations when presented with our negative instructions, particularly Existent Object and Knowledge Manipulation instructions. Moreover, we successfully mitigate hallucination by finetuning MiniGPT4 and mPLUG-Owl on LRV-Instruction while improving performance on several public datasets compared to state-of-the-art methods. Additionally, we observed that a balanced ratio of positive and negative instances in the training data leads to a more robust model. Code and data are available at https://github.com/FuxiaoLiu/LRV-Instruction.
Paper Structure (27 sections, 38 figures, 15 tables)

This paper contains 27 sections, 38 figures, 15 tables.

Figures (38)

  • Figure 1: Given an image and human instruction as the input, we introduce GPT4-Assisted Visual Instruction Evaluation (GAVIE) to assess the output from current LMMs, such as MiniGPT4 and mPLUG-Owl. BLUE represents LMMs can not accurately follow human instructions while RED means they suffer from the hallucination problem. After finetuning current LMMs on our proposed LRV-Instruction dataset, we can generate a more robust answer.
  • Figure 2: Examples of positive and negative instances in our LRV-Instruction dataset. RED means inconsistent elements in the negative instructions. More examples are in the Appendix.
  • Figure 3: One example to illustrate the prompt we use to generate the visual instruction data by GPT4. We use the bounding box coordinates and dense captions to represent image content.
  • Figure 4: Comprehensive Statistic of LRV-Instruction. In (d), BLUE means existent object manipulation. PINK means nonexistent object manipulation. GREEN means knowledge manipulation.
  • Figure 5: An example prompt for text-only GPT4 we use to generate instruction and answers for chart images. The sentence in BLUE is the captions of the chart.
  • ...and 33 more figures