Table of Contents
Fetching ...

EAGLE: Enhanced Visual Grounding Minimizes Hallucinations in Instructional Multimodal Models

Andrés Villa, Juan León Alcázar, Motasem Alfarra, Vladimir Araujo, Alvaro Soto, Bernard Ghanem

TL;DR

EAGLE addresses hallucinations in instructional multimodal models by directly improving visual grounding through a post-pretraining refinement of the Vision Transformer. It introduces a language-aligned, patch-level grounding scheme built on masked pooling and a dual-loss objective that jointly aligns visual features with object-language prompts while maintaining global representations. The method relies on OpenImages V7 for fine-grained supervision and uses a parameter-efficient GaLore-based fine-tuning strategy to avoid disturbing pre-trained distributions. Across three hallucination benchmarks (POPE, MMVP, MERLIM) and six IT-VLMs, EAGLE yields consistent reductions in false positives and hidden hallucinations while preserving zero-shot and linear probing capabilities, demonstrating strong practical impact and broad compatibility with instructional tuning. The work highlights a scalable, data-efficient path to safer multimodal models without architectural overhauls or extensive data curation beyond grounding supervision.

Abstract

Large language models and vision transformers have demonstrated impressive zero-shot capabilities, enabling significant transferability in downstream tasks. The fusion of these models has resulted in multi-modal architectures with enhanced instructional capabilities. Despite incorporating vast image and language pre-training, these multi-modal architectures often generate responses that deviate from the ground truth in the image data. These failure cases are known as hallucinations. Current methods for mitigating hallucinations generally focus on regularizing the language component, improving the fusion module, or ensembling multiple visual encoders to improve visual representation. In this paper, we address the hallucination issue by directly enhancing the capabilities of the visual component. Our approach, named EAGLE, is fully agnostic to the LLM or fusion module and works as a post-pretraining approach that improves the grounding and language alignment of the visual encoder. We show that a straightforward reformulation of the original contrastive pre-training task results in an improved visual encoder that can be incorporated into the instructional multi-modal architecture without additional instructional training. As a result, EAGLE achieves a significant reduction in hallucinations across multiple challenging benchmarks and tasks.

EAGLE: Enhanced Visual Grounding Minimizes Hallucinations in Instructional Multimodal Models

TL;DR

EAGLE addresses hallucinations in instructional multimodal models by directly improving visual grounding through a post-pretraining refinement of the Vision Transformer. It introduces a language-aligned, patch-level grounding scheme built on masked pooling and a dual-loss objective that jointly aligns visual features with object-language prompts while maintaining global representations. The method relies on OpenImages V7 for fine-grained supervision and uses a parameter-efficient GaLore-based fine-tuning strategy to avoid disturbing pre-trained distributions. Across three hallucination benchmarks (POPE, MMVP, MERLIM) and six IT-VLMs, EAGLE yields consistent reductions in false positives and hidden hallucinations while preserving zero-shot and linear probing capabilities, demonstrating strong practical impact and broad compatibility with instructional tuning. The work highlights a scalable, data-efficient path to safer multimodal models without architectural overhauls or extensive data curation beyond grounding supervision.

Abstract

Large language models and vision transformers have demonstrated impressive zero-shot capabilities, enabling significant transferability in downstream tasks. The fusion of these models has resulted in multi-modal architectures with enhanced instructional capabilities. Despite incorporating vast image and language pre-training, these multi-modal architectures often generate responses that deviate from the ground truth in the image data. These failure cases are known as hallucinations. Current methods for mitigating hallucinations generally focus on regularizing the language component, improving the fusion module, or ensembling multiple visual encoders to improve visual representation. In this paper, we address the hallucination issue by directly enhancing the capabilities of the visual component. Our approach, named EAGLE, is fully agnostic to the LLM or fusion module and works as a post-pretraining approach that improves the grounding and language alignment of the visual encoder. We show that a straightforward reformulation of the original contrastive pre-training task results in an improved visual encoder that can be incorporated into the instructional multi-modal architecture without additional instructional training. As a result, EAGLE achieves a significant reduction in hallucinations across multiple challenging benchmarks and tasks.
Paper Structure (29 sections, 3 equations, 4 figures, 8 tables)

This paper contains 29 sections, 3 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: EAGLE visual encoders reduce hallucinations in IT-VLMs. We present three example scenarios, each featuring a question about an image input to a specific IT-VLM with its original visual encoder (left box in pink) and the corresponding EAGLE-tuned visual encoder (right box in orange). “IB7", “IB13" and “IBT5" refer to InstructBLIP with Vicuna7B, Vicuna13B, and FlanT5xl, respectively. EAGLE substantially reduces hallucinations, providing more visually grounded and reliable descriptions. i) In the first example (left), IB13 with the EAGLE-tuned visual encoder can clearly identify fine-grained elements such as a fence, house, and trees. ii) In the second example (center), EAGLE helps IB7 to accurately recognize a dog, even in an unusual viewpoint and context. iii) In the third example (right), EAGLE enhances object localization, allowing the model to precisely identify the laptop’s position. These examples illustrate EAGLE’s effectiveness in improving visual grounding across complex, multi-object scenes.
  • Figure 2: Overview of the EAGLE method. EAGLE reduces the hallucinations in IT-VLMs by improving the grounding of the image encoder. In the post-pretraining phase (Left), EAGLE enhances fine-grained visual representations by employing a masked average pooling (in red dashed lines). This method selects embeddings within the feature sequence corresponding to a specific object and computes an averaged representation. Subsequently, EAGLE enforces local alignment with the language representation of the object (in green dashed lines). The resulting image encoder integrates with any IT-VLM (Right) at inference time, effectively reducing hallucinations without requiring any further tuning.
  • Figure 3: Visual Examples Demonstrating EAGLE's Effectiveness in Reducing Hallucinations in IT-VLMs. We present three additional illustrative scenarios, each featuring a question about an image processed by a specific IT-VLM using its original visual encoder (left, pink box) and the corresponding EAGLE-tuned visual encoder (right, orange box). The models evaluated include “LLA” (LLaVA-1.5), “LLA*” (LLaVA-1.5*), and “BT5” (BLIP-2 with FlanT5xl). EAGLE demonstrates a significant reduction in hallucinations, providing more visually grounded and reliable descriptions. Correct and incorrect predicted nouns are marked by green and red, respectively.
  • Figure 4: Visual Examples of EAGLE Enhancing Visual Grounding of VLMs. We assess the ability of two VLMs, EVA-01-CLIP-g-14 and OpenAI CLIP-L-14-336, and their corresponding EAGLE-tuned versions (blue boxes) to embed fine-grained visual details in the sequence features, using the MMVP-VLM benchmark. Through two visual examples per model, we show that EAGLE effectively captures subtle visual information in images, enabling it to correctly align image-text pairs even when the images differ only in small, specific features. Correct and incorrect alignments are marked by green and red arrows, respectively.