Table of Contents
Fetching ...

DEX-AR: A Dynamic Explainability Method for Autoregressive Vision-Language Models

Walid Bousselham, Angie Boggust, Hendrik Strobelt, Hilde Kuehne

TL;DR

DX-AR (Dynamic Explainability for AutoRegressive models), a novel explainability method designed to address challenges of autoregressive VLMs by generating both per-token and sequence-level 2D heatmaps highlighting image regions crucial for the model's textual responses, is presented.

Abstract

As Vision-Language Models (VLMs) become increasingly sophisticated and widely used, it becomes more and more crucial to understand their decision-making process. Traditional explainability methods, designed for classification tasks, struggle with modern autoregressive VLMs due to their complex token-by-token generation process and intricate interactions between visual and textual modalities. We present DEX-AR (Dynamic Explainability for AutoRegressive models), a novel explainability method designed to address these challenges by generating both per-token and sequence-level 2D heatmaps highlighting image regions crucial for the model's textual responses. The proposed method offers to interpret autoregressive VLMs-including varying importance of layers and generated tokens-by computing layer-wise gradients with respect to attention maps during the token-by-token generation process. DEX-AR introduces two key innovations: a dynamic head filtering mechanism that identifies attention heads focused on visual information, and a sequence-level filtering approach that aggregates per-token explanations while distinguishing between visually-grounded and purely linguistic tokens. Our evaluation on ImageNet, VQAv2, and PascalVOC, shows a consistent improvement in both perturbation-based metrics, using a novel normalized perplexity measure, as well as segmentation-based metrics.

DEX-AR: A Dynamic Explainability Method for Autoregressive Vision-Language Models

TL;DR

DX-AR (Dynamic Explainability for AutoRegressive models), a novel explainability method designed to address challenges of autoregressive VLMs by generating both per-token and sequence-level 2D heatmaps highlighting image regions crucial for the model's textual responses, is presented.

Abstract

As Vision-Language Models (VLMs) become increasingly sophisticated and widely used, it becomes more and more crucial to understand their decision-making process. Traditional explainability methods, designed for classification tasks, struggle with modern autoregressive VLMs due to their complex token-by-token generation process and intricate interactions between visual and textual modalities. We present DEX-AR (Dynamic Explainability for AutoRegressive models), a novel explainability method designed to address these challenges by generating both per-token and sequence-level 2D heatmaps highlighting image regions crucial for the model's textual responses. The proposed method offers to interpret autoregressive VLMs-including varying importance of layers and generated tokens-by computing layer-wise gradients with respect to attention maps during the token-by-token generation process. DEX-AR introduces two key innovations: a dynamic head filtering mechanism that identifies attention heads focused on visual information, and a sequence-level filtering approach that aggregates per-token explanations while distinguishing between visually-grounded and purely linguistic tokens. Our evaluation on ImageNet, VQAv2, and PascalVOC, shows a consistent improvement in both perturbation-based metrics, using a novel normalized perplexity measure, as well as segmentation-based metrics.
Paper Structure (55 sections, 25 equations, 10 figures, 8 tables)

This paper contains 55 sections, 25 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Example of token-level and sentence-level attribution maps for Vision-Language Models (VLMs). Given an input image and prompt, DEX-AR produces per-token heatmaps highlighting relevant image regions for each generated word. These are then aggregated into a final sentence-level heatmap using token-specific weighting scores that reflect visual relevance.
  • Figure 2: Architecture overview of DEX-AR. At each layer $l$, head $i$ and generation step $t$, gradients of attention maps are computed and weighted based on their relative focus on visual versus textual tokens to produce attribution maps.
  • Figure 3: Qualitative Comparison
  • Figure 4: PascalVOC-QA: example of the dataset used to evaluate the quality of the filtering.
  • Figure 5: Qualitative examples of the heatmap generated by DEX-AR for different VLMs with and without filtering.
  • ...and 5 more figures