Table of Contents
Fetching ...

FiVL: A Framework for Improved Vision-Language Alignment through the Lens of Training, Evaluation and Explainability

Estelle Aflalo, Gabriela Ben Melech Stan, Tiep Le, Man Luo, Shachar Rosenman, Sayak Paul, Shao-Yen Tseng, Vasudev Lal

TL;DR

FiVL introduces a framework to improve vision-language alignment by constructing fine-grained, pixel-level grounding datasets and dedicated evaluation protocols. It pairs a novel training objective—Vision Modeling—with segmentation-grounded data to jointly optimize visual and textual representations, yielding improved performance across multiple benchmarks. The framework also provides evaluation datasets and a Visual Reliance Score to quantify image dependence, plus an explainability pathway by identifying attention heads with strong vision-language alignment. Collectively, FiVL demonstrates that grounding-aware training, rigorous visual-reliance evaluation, and interpretable attention mechanisms can reduce visual hallucinations and improve LVLM robustness and transparency.

Abstract

Large Vision Language Models (LVLMs) have achieved significant progress in integrating visual and textual inputs for multimodal reasoning. However, a recurring challenge is ensuring these models utilize visual information as effectively as linguistic content when both modalities are necessary to formulate an accurate answer. We hypothesize that hallucinations arise due to the lack of effective visual grounding in current LVLMs. Furthermore, current vision-language benchmarks are not specifically measuring the degree to which the answer require the visual input. This limitation makes it challenging to confirm that the image is truly necessary, particularly in tasks like visual question answering. In this work, we introduce FiVL, a novel method for constructing datasets designed to train LVLMs for enhanced visual grounding and also evaluate their effectiveness in achieving it. We demonstrate the value of our datasets through three approaches. First, we introduce a novel training task based on our augmented training dataset, resulting in better performance than the baseline. Second, we present benchmarks to assess the model's ability to use image as substantive evidence, rather than relying solely on linguistic priors. Finally, we identify attention heads with the strongest vision-language alignment, enabling explainability on visual-driven hallucinations. The code is available at https://github.com/IntelLabs/fivl.

FiVL: A Framework for Improved Vision-Language Alignment through the Lens of Training, Evaluation and Explainability

TL;DR

FiVL introduces a framework to improve vision-language alignment by constructing fine-grained, pixel-level grounding datasets and dedicated evaluation protocols. It pairs a novel training objective—Vision Modeling—with segmentation-grounded data to jointly optimize visual and textual representations, yielding improved performance across multiple benchmarks. The framework also provides evaluation datasets and a Visual Reliance Score to quantify image dependence, plus an explainability pathway by identifying attention heads with strong vision-language alignment. Collectively, FiVL demonstrates that grounding-aware training, rigorous visual-reliance evaluation, and interpretable attention mechanisms can reduce visual hallucinations and improve LVLM robustness and transparency.

Abstract

Large Vision Language Models (LVLMs) have achieved significant progress in integrating visual and textual inputs for multimodal reasoning. However, a recurring challenge is ensuring these models utilize visual information as effectively as linguistic content when both modalities are necessary to formulate an accurate answer. We hypothesize that hallucinations arise due to the lack of effective visual grounding in current LVLMs. Furthermore, current vision-language benchmarks are not specifically measuring the degree to which the answer require the visual input. This limitation makes it challenging to confirm that the image is truly necessary, particularly in tasks like visual question answering. In this work, we introduce FiVL, a novel method for constructing datasets designed to train LVLMs for enhanced visual grounding and also evaluate their effectiveness in achieving it. We demonstrate the value of our datasets through three approaches. First, we introduce a novel training task based on our augmented training dataset, resulting in better performance than the baseline. Second, we present benchmarks to assess the model's ability to use image as substantive evidence, rather than relying solely on linguistic priors. Finally, we identify attention heads with the strongest vision-language alignment, enabling explainability on visual-driven hallucinations. The code is available at https://github.com/IntelLabs/fivl.

Paper Structure

This paper contains 36 sections, 3 equations, 18 figures, 7 tables.

Figures (18)

  • Figure 1: Dataset Collection Overview. First, GPT4-o processes the question and answer to produce "key expressions", which are then passed to GroundedSAM along with the image to produce segmentation maps.
  • Figure 2: Overview of Vision Modeling pretraining task.
  • Figure 3: Our model trained on FiVL-Instruct evaluated on various benchmarks compared to the baseline.
  • Figure 4: Baseline
  • Figure 5: Our model
  • ...and 13 more figures