Explaining Multi-modal Large Language Models by Analyzing their Vision Perception
Loris Giulivi, Giacomo Boracchi
TL;DR
This work tackles the interpretability challenge of multi-modal large language models by introducing a joint architecture J that aligns an open-world localization encoder (OWL-ViT) with an MLLM (LLaVa) through a learned alignment $W$, enabling simultaneous text generation $\mathbf{O}^{MLLM}$ and object localization $\mathbf{O}^{OWL}$ from a single vision embedding $\mathbf{t}_i^{OWL}$. By sharing the vision representation, the authors develop a Gradient Alignment (GA) saliency map that explains any output token, visualize hallucinations via correlated detection outputs, and design semantic adversarial perturbations to probe biases. They validate the approach with training details on Open Images, a GPT-4 Vision judge for evaluation, and a bias benchmark, showing the model can reveal misperceptions, provide interpretable explanations, and reveal biases in gender and ethnicity under controlled perturbations. The work advances practical interpretability for MLLMs and offers a public code release, enabling applications in bias auditing, hallucination detection, and trustworthy multi-modal reasoning.
Abstract
Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in understanding and generating content across various modalities, such as images and text. However, their interpretability remains a challenge, hindering their adoption in critical applications. This research proposes a novel approach to enhance the interpretability of MLLMs by focusing on the image embedding component. We combine an open-world localization model with a MLLM, thus creating a new architecture able to simultaneously produce text and object localization outputs from the same vision embedding. The proposed architecture greatly promotes interpretability, enabling us to design a novel saliency map to explain any output token, to identify model hallucinations, and to assess model biases through semantic adversarial perturbations.
