Table of Contents
Fetching ...

Explaining Multi-modal Large Language Models by Analyzing their Vision Perception

Loris Giulivi, Giacomo Boracchi

TL;DR

This work tackles the interpretability challenge of multi-modal large language models by introducing a joint architecture J that aligns an open-world localization encoder (OWL-ViT) with an MLLM (LLaVa) through a learned alignment $W$, enabling simultaneous text generation $\mathbf{O}^{MLLM}$ and object localization $\mathbf{O}^{OWL}$ from a single vision embedding $\mathbf{t}_i^{OWL}$. By sharing the vision representation, the authors develop a Gradient Alignment (GA) saliency map that explains any output token, visualize hallucinations via correlated detection outputs, and design semantic adversarial perturbations to probe biases. They validate the approach with training details on Open Images, a GPT-4 Vision judge for evaluation, and a bias benchmark, showing the model can reveal misperceptions, provide interpretable explanations, and reveal biases in gender and ethnicity under controlled perturbations. The work advances practical interpretability for MLLMs and offers a public code release, enabling applications in bias auditing, hallucination detection, and trustworthy multi-modal reasoning.

Abstract

Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in understanding and generating content across various modalities, such as images and text. However, their interpretability remains a challenge, hindering their adoption in critical applications. This research proposes a novel approach to enhance the interpretability of MLLMs by focusing on the image embedding component. We combine an open-world localization model with a MLLM, thus creating a new architecture able to simultaneously produce text and object localization outputs from the same vision embedding. The proposed architecture greatly promotes interpretability, enabling us to design a novel saliency map to explain any output token, to identify model hallucinations, and to assess model biases through semantic adversarial perturbations.

Explaining Multi-modal Large Language Models by Analyzing their Vision Perception

TL;DR

This work tackles the interpretability challenge of multi-modal large language models by introducing a joint architecture J that aligns an open-world localization encoder (OWL-ViT) with an MLLM (LLaVa) through a learned alignment , enabling simultaneous text generation and object localization from a single vision embedding . By sharing the vision representation, the authors develop a Gradient Alignment (GA) saliency map that explains any output token, visualize hallucinations via correlated detection outputs, and design semantic adversarial perturbations to probe biases. They validate the approach with training details on Open Images, a GPT-4 Vision judge for evaluation, and a bias benchmark, showing the model can reveal misperceptions, provide interpretable explanations, and reveal biases in gender and ethnicity under controlled perturbations. The work advances practical interpretability for MLLMs and offers a public code release, enabling applications in bias auditing, hallucination detection, and trustworthy multi-modal reasoning.

Abstract

Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in understanding and generating content across various modalities, such as images and text. However, their interpretability remains a challenge, hindering their adoption in critical applications. This research proposes a novel approach to enhance the interpretability of MLLMs by focusing on the image embedding component. We combine an open-world localization model with a MLLM, thus creating a new architecture able to simultaneously produce text and object localization outputs from the same vision embedding. The proposed architecture greatly promotes interpretability, enabling us to design a novel saliency map to explain any output token, to identify model hallucinations, and to assess model biases through semantic adversarial perturbations.
Paper Structure (16 sections, 3 equations, 4 figures, 3 tables)

This paper contains 16 sections, 3 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of the proposed architecture and its uses for interpretability.
  • Figure 2: Example images that lead to hallucinations. The error is reflected both in the language output ($\mathbf{O}^{MLLM}$) and in the detection output ($\mathbf{O}^{OWL}$).
  • Figure 3: Example GA saliency maps for different objects in one MLLM output.
  • Figure 4: Example question (Q) and answers (A1, A2) from the user study.