Table of Contents
Fetching ...

Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models

Jinrui Zhang, Teng Wang, Haigang Zhang, Ping Lu, Feng Zheng

TL;DR

This work addresses hallucinations in large vision-language models by introducing reflective instruction tuning, which integrates positive and negative rationale learning into visual instruction tuning. Central to the approach is REVERIE, a large-scale dataset providing richly annotated rationales and hard negatives to guide fine-grained reasoning. Across experiments on two LVLMs and six benchmarks, reflective tuning yields notable performance gains and reduced hallucinations, with negative rationales offering additional improvements. By decoupling rationale generation from response prediction and applying consistency-based data filtering, the method advances reliable multimodal reasoning with practical significance for real-world LVLM deployment.

Abstract

Large vision-language models (LVLMs) have shown promising performance on a variety of vision-language tasks. However, they remain susceptible to hallucinations, generating outputs misaligned with visual content or instructions. While various mitigation strategies have been proposed, they often neglect a key contributor to hallucinations: lack of fine-grained reasoning supervision during training. Without intermediate reasoning steps, models may establish superficial shortcuts between instructions and responses, failing to internalize the inherent reasoning logic. To address this challenge, we propose reflective instruction tuning, which integrates rationale learning into visual instruction tuning. Unlike previous methods that learning from responses only, our approach entails the model predicting rationales justifying why responses are correct or incorrect. This fosters a deeper engagement with the fine-grained reasoning underlying each response, thus enhancing the model's reasoning proficiency. To facilitate this approach, we propose REVERIE, the first large-scale instruction-tuning dataset with ReflEctiVE RatIonalE annotations. REVERIE comprises 115k machine-generated reasoning instructions, each meticulously annotated with a corresponding pair of correct and confusing responses, alongside comprehensive rationales elucidating the justification behind the correctness or erroneousness of each response. Experimental results on multiple LVLM benchmarks reveal that reflective instruction tuning with the REVERIE dataset yields noticeable performance gain over the baseline model, demonstrating the effectiveness of reflecting from the rationales. Project page is at https://zjr2000.github.io/projects/reverie.

Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models

TL;DR

This work addresses hallucinations in large vision-language models by introducing reflective instruction tuning, which integrates positive and negative rationale learning into visual instruction tuning. Central to the approach is REVERIE, a large-scale dataset providing richly annotated rationales and hard negatives to guide fine-grained reasoning. Across experiments on two LVLMs and six benchmarks, reflective tuning yields notable performance gains and reduced hallucinations, with negative rationales offering additional improvements. By decoupling rationale generation from response prediction and applying consistency-based data filtering, the method advances reliable multimodal reasoning with practical significance for real-world LVLM deployment.

Abstract

Large vision-language models (LVLMs) have shown promising performance on a variety of vision-language tasks. However, they remain susceptible to hallucinations, generating outputs misaligned with visual content or instructions. While various mitigation strategies have been proposed, they often neglect a key contributor to hallucinations: lack of fine-grained reasoning supervision during training. Without intermediate reasoning steps, models may establish superficial shortcuts between instructions and responses, failing to internalize the inherent reasoning logic. To address this challenge, we propose reflective instruction tuning, which integrates rationale learning into visual instruction tuning. Unlike previous methods that learning from responses only, our approach entails the model predicting rationales justifying why responses are correct or incorrect. This fosters a deeper engagement with the fine-grained reasoning underlying each response, thus enhancing the model's reasoning proficiency. To facilitate this approach, we propose REVERIE, the first large-scale instruction-tuning dataset with ReflEctiVE RatIonalE annotations. REVERIE comprises 115k machine-generated reasoning instructions, each meticulously annotated with a corresponding pair of correct and confusing responses, alongside comprehensive rationales elucidating the justification behind the correctness or erroneousness of each response. Experimental results on multiple LVLM benchmarks reveal that reflective instruction tuning with the REVERIE dataset yields noticeable performance gain over the baseline model, demonstrating the effectiveness of reflecting from the rationales. Project page is at https://zjr2000.github.io/projects/reverie.
Paper Structure (13 sections, 11 figures, 11 tables)

This paper contains 13 sections, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Difference between vanilla instruction tuning and the proposed reflective instruction tuning. Vanilla instruction tuning only trains LVLMs solely for response generation, lacking of supervising the learning of fine-grained reasoning details. Reflective instruction tuning additionally trains the model to reflect the rationale underlying the response, which provides more fine-grained supervision (e.g., the key visual evidence and facts to reach the response, highlighted in red), facilitating the model learning to capture more critical information.
  • Figure 2: Overview of the REVERIE dataset's data collection pipeline. We first employ Gemini-Vision-Pro to annotate the instructions, responses, and rationales for each image. Gemini-Pro is then used to check the consistency between positive and negative rationales. Inconsistent samples are filtered to maintain dataset quality.
  • Figure 3: Statistics of the REVERIE dataset.
  • Figure 4: Visualization of the generation of positive rationales.
  • Figure 5: Visualization of the generation of negative rationales.
  • ...and 6 more figures