Table of Contents
Fetching ...

MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization

Animesh Jain, Alexandros Stergiou

Abstract

Vision Language Models (VLMs) encode multimodal inputs over large, complex, and difficult-to-interpret architectures, which limit transparency and trust. We propose a Multimodal Inversion for Model Interpretation and Conceptualization (MIMIC) framework that inverts the internal encodings of VLMs. MIMIC uses a joint VLM-based inversion and a feature alignment objective to account for VLM's autoregressive processing. It additionally includes a triplet of regularizers for spatial alignment, natural image smoothness, and semantic realism. We evaluate MIMIC both quantitatively and qualitatively by inverting visual concepts across a range of free-form VLM outputs of varying length. Reported results include both standard visual quality metrics and semantic text-based metrics. To the best of our knowledge, this is the first model inversion approach addressing visual interpretations of VLM concepts.

MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization

Abstract

Vision Language Models (VLMs) encode multimodal inputs over large, complex, and difficult-to-interpret architectures, which limit transparency and trust. We propose a Multimodal Inversion for Model Interpretation and Conceptualization (MIMIC) framework that inverts the internal encodings of VLMs. MIMIC uses a joint VLM-based inversion and a feature alignment objective to account for VLM's autoregressive processing. It additionally includes a triplet of regularizers for spatial alignment, natural image smoothness, and semantic realism. We evaluate MIMIC both quantitatively and qualitatively by inverting visual concepts across a range of free-form VLM outputs of varying length. Reported results include both standard visual quality metrics and semantic text-based metrics. To the best of our knowledge, this is the first model inversion approach addressing visual interpretations of VLM concepts.

Paper Structure

This paper contains 7 sections, 5 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Multimodal Inversion for Model Interpretation and Conceptualization (MIMIC) inverts VLMs by synthesizing visual inputs that best correspond to VLM tokens and internal embeddings. The synthesized images represent the dominant visual features associated with predicted tokens.
  • Figure 2: MIMIC inversion iteratively optimizes an updatable input $\color{grdpurp}\widehat{\mathbf{v}}$ with an adapted cross-entropy loss, $\mathcal{L}_{SCE}$, to maximize the probability distribution of [target] VLM token(s), and a base feature loss, $\mathcal{L}_{base}$, to match layer statistics to target mean and variance within the distribution manifold for vision tokens. Regularizers $\mathcal{R}$ are added to promote variance consistency, visual coherence across tokens, and perceptual quality. Optimized visual inputs are shown from inverting visual-instruct-tuned LLaMA3-8B alongside the per-[target]-token probability $P(\texttt{[target]})$.
  • Figure 2: Predicted to[target]VLM outputs across text similarity metrics (BLEU, METEOR, ROUGE-L) on LLaMA3-8B. Results are grouped by text length.
  • Figure 3: Qualitative examples of synthesized visual inputs with MIMIC. Each row optimizes different token logits. The number of tokens corresponding to a [target] semantic concept differs per row.
  • Figure 4: Synthesized images over varying text prompts for dock, magnetic compass, and obelisk. $\mathbf{t}_1$; What is shown in the image?a.[target] or b.[negative], $\mathbf{t}_2$; Does the image show an instance of [target] or [negative]?, and $\mathbf{t}_3$; The image depicts a scene that corresponds to [target] or [negative]?
  • ...and 4 more figures