Explaining latent representations of generative models with large multimodal models
Mengdan Zhu, Zhenke Liu, Bo Pan, Abhinav Angirekula, Liang Zhao
TL;DR
This work tackles the interpretability of generative latent factors by using large multimodal models to generate explanations for each latent dimension via image-sequence perturbations. It couples latent-space perturbations with image sequences and prompts an LMM to produce explanations, then uses an uncertainty score derived from cross-sample similarities to select reliable outputs, with an empirically learned threshold. Across MNIST, dsprites, and 3dshapes and three VAE variants, GPT-4-vision consistently delivers higher-quality explanations than Bard, LLaVA, or InstructBLIP, achieving a high interpretability accuracy (AUC up to 0.9694). The results also reveal a link between latent disentanglement and explanation reliability, and highlight current LMM limitations in geometry and color understanding that constrain explanations.
Abstract
Learning interpretable representations of data generative latent factors is an important topic for the development of artificial intelligence. With the rise of the large multimodal model, it can align images with text to generate answers. In this work, we propose a framework to comprehensively explain each latent variable in the generative models using a large multimodal model. We further measure the uncertainty of our generated explanations, quantitatively evaluate the performance of explanation generation among multiple large multimodal models, and qualitatively visualize the variations of each latent variable to learn the disentanglement effects of different generative models on explanations. Finally, we discuss the explanatory capabilities and limitations of state-of-the-art large multimodal models.
