Table of Contents
Fetching ...

LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models

Gabriela Ben Melech Stan, Estelle Aflalo, Raanan Yehezkel Rohekar, Anahita Bhiwandiwalla, Shao-Yen Tseng, Matthew Lyle Olson, Yaniv Gurwicz, Chenfei Wu, Nan Duan, Vasudev Lal

TL;DR

LVLM-Interpret integrates raw attention visualization, relevancy maps, and causal explanations into an interactive tool for interrogating large vision-language models. It demonstrates interface capabilities and applies a case study to LLaVA to reveal how attention and relevancy relate to image grounding and failure modes. The findings suggest that LVLMs can rely variably on text versus image content, and the causal-interpretation framework helps identify minimal input subsets driving outputs. The work offers a practical aid for debugging, improving grounding, and guiding future LVLM development.

Abstract

In the rapidly evolving landscape of artificial intelligence, multi-modal large language models are emerging as a significant area of interest. These models, which combine various forms of data input, are becoming increasingly popular. However, understanding their internal mechanisms remains a complex task. Numerous advancements have been made in the field of explainability tools and mechanisms, yet there is still much to explore. In this work, we present a novel interactive application aimed towards understanding the internal mechanisms of large vision-language models. Our interface is designed to enhance the interpretability of the image patches, which are instrumental in generating an answer, and assess the efficacy of the language model in grounding its output in the image. With our application, a user can systematically investigate the model and uncover system limitations, paving the way for enhancements in system capabilities. Finally, we present a case study of how our application can aid in understanding failure mechanisms in a popular large multi-modal model: LLaVA.

LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models

TL;DR

LVLM-Interpret integrates raw attention visualization, relevancy maps, and causal explanations into an interactive tool for interrogating large vision-language models. It demonstrates interface capabilities and applies a case study to LLaVA to reveal how attention and relevancy relate to image grounding and failure modes. The findings suggest that LVLMs can rely variably on text versus image content, and the causal-interpretation framework helps identify minimal input subsets driving outputs. The work offers a practical aid for debugging, improving grounding, and guiding future LVLM development.

Abstract

In the rapidly evolving landscape of artificial intelligence, multi-modal large language models are emerging as a significant area of interest. These models, which combine various forms of data input, are becoming increasingly popular. However, understanding their internal mechanisms remains a complex task. Numerous advancements have been made in the field of explainability tools and mechanisms, yet there is still much to explore. In this work, we present a novel interactive application aimed towards understanding the internal mechanisms of large vision-language models. Our interface is designed to enhance the interpretability of the image patches, which are instrumental in generating an answer, and assess the efficacy of the language model in grounding its output in the image. With our application, a user can systematically investigate the model and uncover system limitations, paving the way for enhancements in system capabilities. Finally, we present a case study of how our application can aid in understanding failure mechanisms in a popular large multi-modal model: LLaVA.
Paper Structure (10 sections, 6 figures)

This paper contains 10 sections, 6 figures.

Figures (6)

  • Figure 1: Main interface of LVLM-Interpret. Users can issue multimodal queries using a chatbot interface. Basic image-editing feature allows for model probing.
  • Figure 2: Visualization of cross-modal attentions
  • Figure 3: Causality-based explanation for the token 'yellow' in the generated answer 'The man's shirt is yellow' at head 24. (a) Top 50 image-tokens having the highest raw attention values. Each serves as a graph node. (b-e) Image tokens from the explanation set identified by the CLEANN method, at different search distances on the learned causal graph. Tokens are marked with yellow blobs.
  • Figure 4: A tree constructed from the causal graph from which explanations for the token 'yellow' are extracted. Arc radius indicates distance on the causal graph. Edges are color coded, bi-directed edges indicate a latent confounder, a circle edge-mark indicates that both a 'tail' and 'arrow' are valid.
  • Figure 5: Example where LLaVA seems to prioritize text input over image content. Presented with an unchanging image of a garbage truck, the model provides contradictory responses ('yes, the door is open' vs. 'yes, the door is closed') based on the query's phrasing. Relevancy maps and bar plots for open and closed tokens demonstrate higher text relevance compared to image.
  • ...and 1 more figures