Table of Contents
Fetching ...

A Concept-Based Explainability Framework for Large Multimodal Models

Jayneel Parekh, Pegah Khayatan, Mustafa Shukor, Alasdair Newson, Matthieu Cord

TL;DR

This work tackles the interpretability of large multimodal models by introducing CoX-LMM, a dictionary-learning framework that extracts multimodal concepts tied to a target token. By constructing a token-centered representation matrix from LMM internals and applying a Semi-NMF decomposition, the method yields a dictionary of concept vectors whose activations can be grounded in both vision (via visual samples) and text (via the unembedding of the LLM). The approach is validated on COCO-based captioning models (e.g., DePALM) and corroborated with LLaVA experiments, showing meaningful multimodal grounding, balanced disentanglement, and useful local interpretations for test samples. The results indicate that deeperTransformer layers better reveal multimodal structure, enabling more transparent insight into how LMMs represent and process multimodal information, with potential to enhance trust and debugging in practical deployments.

Abstract

Large multimodal models (LMMs) combine unimodal encoders and large language models (LLMs) to perform multimodal tasks. Despite recent advancements towards the interpretability of these models, understanding internal representations of LMMs remains largely a mystery. In this paper, we present a novel framework for the interpretation of LMMs. We propose a dictionary learning based approach, applied to the representation of tokens. The elements of the learned dictionary correspond to our proposed concepts. We show that these concepts are well semantically grounded in both vision and text. Thus we refer to these as ``multi-modal concepts''. We qualitatively and quantitatively evaluate the results of the learnt concepts. We show that the extracted multimodal concepts are useful to interpret representations of test samples. Finally, we evaluate the disentanglement between different concepts and the quality of grounding concepts visually and textually. Our code is publicly available at https://github.com/mshukor/xl-vlms

A Concept-Based Explainability Framework for Large Multimodal Models

TL;DR

This work tackles the interpretability of large multimodal models by introducing CoX-LMM, a dictionary-learning framework that extracts multimodal concepts tied to a target token. By constructing a token-centered representation matrix from LMM internals and applying a Semi-NMF decomposition, the method yields a dictionary of concept vectors whose activations can be grounded in both vision (via visual samples) and text (via the unembedding of the LLM). The approach is validated on COCO-based captioning models (e.g., DePALM) and corroborated with LLaVA experiments, showing meaningful multimodal grounding, balanced disentanglement, and useful local interpretations for test samples. The results indicate that deeperTransformer layers better reveal multimodal structure, enabling more transparent insight into how LMMs represent and process multimodal information, with potential to enhance trust and debugging in practical deployments.

Abstract

Large multimodal models (LMMs) combine unimodal encoders and large language models (LLMs) to perform multimodal tasks. Despite recent advancements towards the interpretability of these models, understanding internal representations of LMMs remains largely a mystery. In this paper, we present a novel framework for the interpretation of LMMs. We propose a dictionary learning based approach, applied to the representation of tokens. The elements of the learned dictionary correspond to our proposed concepts. We show that these concepts are well semantically grounded in both vision and text. Thus we refer to these as ``multi-modal concepts''. We qualitatively and quantitatively evaluate the results of the learnt concepts. We show that the extracted multimodal concepts are useful to interpret representations of test samples. Finally, we evaluate the disentanglement between different concepts and the quality of grounding concepts visually and textually. Our code is publicly available at https://github.com/mshukor/xl-vlms
Paper Structure (43 sections, 6 equations, 23 figures, 13 tables)

This paper contains 43 sections, 6 equations, 23 figures, 13 tables.

Figures (23)

  • Figure 1: Overview of multimodal concept extraction and grounding in CoX-LMM. Given a pretrained LMM for captioning and a target token (for eg. 'Dog'), our method extracts internal representations of $f$ about $t$, across many images. These representations are collated into a matrix $\mathbf {Z}$. We linearly decompose $\mathbf {Z}$ to learn a concept dictionary $\mathbf {U}$ and its coefficients/activations $\mathbf {V}$. Each concept $u_k \in \mathbf {U}$, is multimodally grounded in both visual and textual domains. For text grounding, we compute the set of most probable words $\mathbf {T}_k$ by decoding $u_k$ through the unembedding matrix $W_U$. Visual grounding $\mathbf {X}_{k, MAS}$ is obtained via $v_k$ as the set of most activating samples.
  • Figure 2: Example of multimodal concept grounding in vision and text. Five most activating samples (among decomposed in $\mathbf {Z}$) and five most probable decoded words are shown.
  • Figure 3: Evaluating visual/text grounding (CLIPScore/BERTScore). Each point denotes score for grounded words of a concept (Semi-NMF) vs Rnd-Words w.r.t the same visual grounding.
  • Figure 4: Visual/textual grounding for 8 out of 20 concepts for 'Dog' token (layer 31). For each concept we illustrate the set of 5 most activating samples and 5 most probable decoded words.
  • Figure 5: Local interpretations for test samples for different tokens ('Dog', 'Cat', 'Bus') with Semi-NMF (layer 31). Visual/text grounding for three highest concept activations (normalized) is shown.
  • ...and 18 more figures