Table of Contents
Fetching ...

Explaining latent representations of generative models with large multimodal models

Mengdan Zhu, Zhenke Liu, Bo Pan, Abhinav Angirekula, Liang Zhao

TL;DR

This work tackles the interpretability of generative latent factors by using large multimodal models to generate explanations for each latent dimension via image-sequence perturbations. It couples latent-space perturbations with image sequences and prompts an LMM to produce explanations, then uses an uncertainty score derived from cross-sample similarities to select reliable outputs, with an empirically learned threshold. Across MNIST, dsprites, and 3dshapes and three VAE variants, GPT-4-vision consistently delivers higher-quality explanations than Bard, LLaVA, or InstructBLIP, achieving a high interpretability accuracy (AUC up to 0.9694). The results also reveal a link between latent disentanglement and explanation reliability, and highlight current LMM limitations in geometry and color understanding that constrain explanations.

Abstract

Learning interpretable representations of data generative latent factors is an important topic for the development of artificial intelligence. With the rise of the large multimodal model, it can align images with text to generate answers. In this work, we propose a framework to comprehensively explain each latent variable in the generative models using a large multimodal model. We further measure the uncertainty of our generated explanations, quantitatively evaluate the performance of explanation generation among multiple large multimodal models, and qualitatively visualize the variations of each latent variable to learn the disentanglement effects of different generative models on explanations. Finally, we discuss the explanatory capabilities and limitations of state-of-the-art large multimodal models.

Explaining latent representations of generative models with large multimodal models

TL;DR

This work tackles the interpretability of generative latent factors by using large multimodal models to generate explanations for each latent dimension via image-sequence perturbations. It couples latent-space perturbations with image sequences and prompts an LMM to produce explanations, then uses an uncertainty score derived from cross-sample similarities to select reliable outputs, with an empirically learned threshold. Across MNIST, dsprites, and 3dshapes and three VAE variants, GPT-4-vision consistently delivers higher-quality explanations than Bard, LLaVA, or InstructBLIP, achieving a high interpretability accuracy (AUC up to 0.9694). The results also reveal a link between latent disentanglement and explanation reliability, and highlight current LMM limitations in geometry and color understanding that constrain explanations.

Abstract

Learning interpretable representations of data generative latent factors is an important topic for the development of artificial intelligence. With the rise of the large multimodal model, it can align images with text to generate answers. In this work, we propose a framework to comprehensively explain each latent variable in the generative models using a large multimodal model. We further measure the uncertainty of our generated explanations, quantitatively evaluate the performance of explanation generation among multiple large multimodal models, and qualitatively visualize the variations of each latent variable to learn the disentanglement effects of different generative models on explanations. Finally, we discuss the explanatory capabilities and limitations of state-of-the-art large multimodal models.
Paper Structure (10 sections, 9 figures, 2 tables)

This paper contains 10 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: The model framework consists of generating an image sequence with a progressively varying latent variable, combining it with a prompt to pass to a large multimodal model to provide some response samples and finally utilizing an uncertainty measure to select an explanation for that specific latent variable and decide whether there is a clear explanation to display.
  • Figure 2: The sample explanations generated by our framework. The latent variables are highlighted in bold, and the patterns of the latent variables are in italics and underlined.
  • Figure 3: Sample images with clear patterns and sample prompts for GPT-4-vision to generate explanations
  • Figure 4: Sample images with unclear patterns and sample prompts for GPT-4-vision to generate explanations
  • Figure 5: The sample explanation generated by GPT-4-vision for the MNIST dataset.
  • ...and 4 more figures