Table of Contents
Fetching ...

Grounded Visual Factualization: Factual Anchor-Based Finetuning for Enhancing MLLM Factual Consistency

Filippo Morbiato, Luca Romano, Alessandro Persona

TL;DR

This work targets visual hallucination in Multimodal Large Language Models by introducing Grounded Visual Factualization (GVF) Finetuning, a method that embeds explicit factual signals into training via Factual Anchor Data Augmentation, Fact-Aware Instruction Tuning, and a Factual Consistency Loss. By linking observed image facts to structured anchors and counter-factual prompts, GVF reinforces objective visual grounding and penalizes factual inconsistencies during learning. Evaluated on LLaVA-1.5-13B, GVF significantly improves VHTest OEQ and YNQ performance and maintains or slightly enhances general multimodal benchmarks like MME and POPE, demonstrating robust mitigation of visual hallucinations without harming broader reasoning abilities. The ablation and sensitivity analyses show the Factual Consistency Loss as the key contributor, with an optimal loss weight around $\lambda=1.0$, confirming the value of explicit factual penalties in stabilizing factual outputs in multimodal settings.

Abstract

Visual hallucination, where Multimodal Large Language Models fabricate details inconsistent with image content, critically undermines their reliability. Existing fine-tuning methods offer limited improvement, failing to deeply intervene in factual reasoning. This paper introduces Grounded Visual Factualization (GVF) Finetuning, a novel approach to systematically enhance MLLM visual factual consistency. GVF integrates explicit factual signals via three core mechanisms: Factual Anchor Data Augmentation, enriching training data with structured factual anchors and counter-factual prompts; Fact-Aware Instruction Tuning, embedding these cues into explicit instructions; and a Factual Consistency Loss function, specifically penalizing factual inaccuracies. Evaluated on LLaVA-1.5-13B, GVF Finetuning significantly outperforms standard fine-tuning on the VHTest benchmark for both Open-Ended Question (OEQ) and Yes/No Question (YNQ) formats. Crucially, GVF maintains or even slightly improves performance on general multimodal benchmarks like MME and POPE, demonstrating effective mitigation of visual hallucinations without compromising general understanding and reasoning abilities.

Grounded Visual Factualization: Factual Anchor-Based Finetuning for Enhancing MLLM Factual Consistency

TL;DR

This work targets visual hallucination in Multimodal Large Language Models by introducing Grounded Visual Factualization (GVF) Finetuning, a method that embeds explicit factual signals into training via Factual Anchor Data Augmentation, Fact-Aware Instruction Tuning, and a Factual Consistency Loss. By linking observed image facts to structured anchors and counter-factual prompts, GVF reinforces objective visual grounding and penalizes factual inconsistencies during learning. Evaluated on LLaVA-1.5-13B, GVF significantly improves VHTest OEQ and YNQ performance and maintains or slightly enhances general multimodal benchmarks like MME and POPE, demonstrating robust mitigation of visual hallucinations without harming broader reasoning abilities. The ablation and sensitivity analyses show the Factual Consistency Loss as the key contributor, with an optimal loss weight around , confirming the value of explicit factual penalties in stabilizing factual outputs in multimodal settings.

Abstract

Visual hallucination, where Multimodal Large Language Models fabricate details inconsistent with image content, critically undermines their reliability. Existing fine-tuning methods offer limited improvement, failing to deeply intervene in factual reasoning. This paper introduces Grounded Visual Factualization (GVF) Finetuning, a novel approach to systematically enhance MLLM visual factual consistency. GVF integrates explicit factual signals via three core mechanisms: Factual Anchor Data Augmentation, enriching training data with structured factual anchors and counter-factual prompts; Fact-Aware Instruction Tuning, embedding these cues into explicit instructions; and a Factual Consistency Loss function, specifically penalizing factual inaccuracies. Evaluated on LLaVA-1.5-13B, GVF Finetuning significantly outperforms standard fine-tuning on the VHTest benchmark for both Open-Ended Question (OEQ) and Yes/No Question (YNQ) formats. Crucially, GVF maintains or even slightly improves performance on general multimodal benchmarks like MME and POPE, demonstrating effective mitigation of visual hallucinations without compromising general understanding and reasoning abilities.

Paper Structure

This paper contains 28 sections, 3 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The core problem of Visual Hallucination in MLLMs and how GVF Finetuning leads to Factual Grounding in image understanding.
  • Figure 2: Sensitivity Analysis of Factual Consistency Loss Weight ($\lambda$)
  • Figure 3: Qualitative Comparison and Error Patterns