Table of Contents
Fetching ...

Multi-Faceted Multimodal Monosemanticity

Hanqi Yan, Xiangxiang Cui, Lu Yin, Paul Pu Liang, Yulan He, Yifei Wang

TL;DR

This work probes how multimodal signals are represented in large vision-language models by introducing the Modality Dominance Score (MDS) and a pipeline to extract monosemantic multimodal features. It frameworks two interpretable modules, Multimodal SAE and Multimodal NCL, to obtain sparse, interpretable embeddings and then classifies features into ImgD, TextD, and CrossD, elucidating modality-specific and cross-modal representations. Quantitative and qualitative evaluations show improved monosemanticity with these tools and reveal modality-aligned patterns that align with human intuition, enabling downstream tasks such as gender-bias analysis, adversarial defense, and modality-aware text-to-image control. The study offers a scalable interpretability toolkit for multimodal models and sheds light on how different modalities are embedded and manipulated, with implications for bias, robustness, and controllable generation.

Abstract

Humans experience the world through multiple modalities, such as, vision, language, and speech, making it natural to explore the commonality and distinctions among them. In this work, we take a data-driven approach to address this question by analyzing interpretable, monosemantic features extracted from deep multimodal models. Specifically, we investigate CLIP, a prominent visual-language representation model trained on massive image-text pairs. Building on prior research in single-modal interpretability, we develop a set of multi-modal interpretability tools and measures designed to disentangle and analyze features learned from CLIP. Specifically, we introduce the Modality Dominance Score (MDS) to attribute each CLIP feature to a specific modality. We then map CLIP features into a more interpretable space, enabling us to categorize them into three distinct classes: vision features (single-modal), language features (single-modal), and visual-language features (cross-modal). Interestingly, this data-driven categorization closely aligns with human intuitive understandings of different modalities. We further show that this modality decomposition can benefit multiple downstream tasks, including reducing bias in gender detection, generating cross-modal adversarial examples, and enabling modal-specific feature control in text-to-image generation. These results indicate that large-scale multimodal models, when equipped with task-agnostic interpretability tools, can offer valuable insights into the relationships between different data modalities.

Multi-Faceted Multimodal Monosemanticity

TL;DR

This work probes how multimodal signals are represented in large vision-language models by introducing the Modality Dominance Score (MDS) and a pipeline to extract monosemantic multimodal features. It frameworks two interpretable modules, Multimodal SAE and Multimodal NCL, to obtain sparse, interpretable embeddings and then classifies features into ImgD, TextD, and CrossD, elucidating modality-specific and cross-modal representations. Quantitative and qualitative evaluations show improved monosemanticity with these tools and reveal modality-aligned patterns that align with human intuition, enabling downstream tasks such as gender-bias analysis, adversarial defense, and modality-aware text-to-image control. The study offers a scalable interpretability toolkit for multimodal models and sheds light on how different modalities are embedded and manipulated, with implications for bias, robustness, and controllable generation.

Abstract

Humans experience the world through multiple modalities, such as, vision, language, and speech, making it natural to explore the commonality and distinctions among them. In this work, we take a data-driven approach to address this question by analyzing interpretable, monosemantic features extracted from deep multimodal models. Specifically, we investigate CLIP, a prominent visual-language representation model trained on massive image-text pairs. Building on prior research in single-modal interpretability, we develop a set of multi-modal interpretability tools and measures designed to disentangle and analyze features learned from CLIP. Specifically, we introduce the Modality Dominance Score (MDS) to attribute each CLIP feature to a specific modality. We then map CLIP features into a more interpretable space, enabling us to categorize them into three distinct classes: vision features (single-modal), language features (single-modal), and visual-language features (cross-modal). Interestingly, this data-driven categorization closely aligns with human intuitive understandings of different modalities. We further show that this modality decomposition can benefit multiple downstream tasks, including reducing bias in gender detection, generating cross-modal adversarial examples, and enabling modal-specific feature control in text-to-image generation. These results indicate that large-scale multimodal models, when equipped with task-agnostic interpretability tools, can offer valuable insights into the relationships between different data modalities.

Paper Structure

This paper contains 37 sections, 7 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Modality Dominance Score (MDS) distributions of three feature categories for different VLMs.
  • Figure 2: Monosemanticity for four VLMs.
  • Figure 3: Modality-specific monosemanticity.
  • Figure 4: Activated images and texts (in Table) by ImgD. Top image row (feature 647): patterns and textures. Bottom image (feature 667): water and aquatic themes in blue. Texts in blue align with visual concepts.
  • Figure 5: Activated images and texts (in Table) by TextD. Top image row (feature 34): couples and individuals in red attire. Bottom image row (feature 242): diverse objects. Text in blue aligns with visual concepts.
  • ...and 9 more figures