DEX-AR: A Dynamic Explainability Method for Autoregressive Vision-Language Models

Walid Bousselham; Angie Boggust; Hendrik Strobelt; Hilde Kuehne

DEX-AR: A Dynamic Explainability Method for Autoregressive Vision-Language Models

Walid Bousselham, Angie Boggust, Hendrik Strobelt, Hilde Kuehne

TL;DR

DX-AR (Dynamic Explainability for AutoRegressive models), a novel explainability method designed to address challenges of autoregressive VLMs by generating both per-token and sequence-level 2D heatmaps highlighting image regions crucial for the model's textual responses, is presented.

Abstract

As Vision-Language Models (VLMs) become increasingly sophisticated and widely used, it becomes more and more crucial to understand their decision-making process. Traditional explainability methods, designed for classification tasks, struggle with modern autoregressive VLMs due to their complex token-by-token generation process and intricate interactions between visual and textual modalities. We present DEX-AR (Dynamic Explainability for AutoRegressive models), a novel explainability method designed to address these challenges by generating both per-token and sequence-level 2D heatmaps highlighting image regions crucial for the model's textual responses. The proposed method offers to interpret autoregressive VLMs-including varying importance of layers and generated tokens-by computing layer-wise gradients with respect to attention maps during the token-by-token generation process. DEX-AR introduces two key innovations: a dynamic head filtering mechanism that identifies attention heads focused on visual information, and a sequence-level filtering approach that aggregates per-token explanations while distinguishing between visually-grounded and purely linguistic tokens. Our evaluation on ImageNet, VQAv2, and PascalVOC, shows a consistent improvement in both perturbation-based metrics, using a novel normalized perplexity measure, as well as segmentation-based metrics.

DEX-AR: A Dynamic Explainability Method for Autoregressive Vision-Language Models

TL;DR

Abstract

Paper Structure (55 sections, 25 equations, 10 figures, 8 tables)

This paper contains 55 sections, 25 equations, 10 figures, 8 tables.

Introduction
Related Works
Method
Autoregressive Vision-Language Models
Explainability per Token
Sequence-level Explainability Map
Experiments
Datasets and Tasks
Perturbation:
Segmentation-based Evaluation
Filler-words Filtering Evaluation
Experimental Setup
Implementation Details.
Baselines:
Perturbation Results
...and 40 more sections

Figures (10)

Figure 1: Example of token-level and sentence-level attribution maps for Vision-Language Models (VLMs). Given an input image and prompt, DEX-AR produces per-token heatmaps highlighting relevant image regions for each generated word. These are then aggregated into a final sentence-level heatmap using token-specific weighting scores that reflect visual relevance.
Figure 2: Architecture overview of DEX-AR. At each layer $l$, head $i$ and generation step $t$, gradients of attention maps are computed and weighted based on their relative focus on visual versus textual tokens to produce attribution maps.
Figure 3: Qualitative Comparison
Figure 4: PascalVOC-QA: example of the dataset used to evaluate the quality of the filtering.
Figure 5: Qualitative examples of the heatmap generated by DEX-AR for different VLMs with and without filtering.
...and 5 more figures

DEX-AR: A Dynamic Explainability Method for Autoregressive Vision-Language Models

TL;DR

Abstract

DEX-AR: A Dynamic Explainability Method for Autoregressive Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (10)