Table of Contents
Fetching ...

Understanding Multimodal Deep Neural Networks: A Concept Selection View

Chenming Shang, Hengyuan Zhang, Hao Wen, Yujiu Yang

TL;DR

This work tackles the interpretability of multimodal DNNs, particularly CLIP, by replacing expert-defined concepts with automatically discovered ones. It introduces a two-stage Concept Selection Model (CSM) that first greedily identifies head concepts via variance-based rough selection and then refines them with a learnable mask to obtain core concepts for a linear classifier, all without human priors. The approach reveals a long-tail distribution of concepts and demonstrates that a compact set of core concepts can achieve accuracy comparable to end-to-end black-box models while enabling model debugging and human-centered evaluation. This yields more transparent decision-making in multimodal models and offers a pathway for robust, data-driven interpretability and few-shot improvements.

Abstract

The multimodal deep neural networks, represented by CLIP, have generated rich downstream applications owing to their excellent performance, thus making understanding the decision-making process of CLIP an essential research topic. Due to the complex structure and the massive pre-training data, it is often regarded as a black-box model that is too difficult to understand and interpret. Concept-based models map the black-box visual representations extracted by deep neural networks onto a set of human-understandable concepts and use the concepts to make predictions, enhancing the transparency of the decision-making process. However, these methods involve the datasets labeled with fine-grained attributes by expert knowledge, which incur high costs and introduce excessive human prior knowledge and bias. In this paper, we observe the long-tail distribution of concepts, based on which we propose a two-stage Concept Selection Model (CSM) to mine core concepts without introducing any human priors. The concept greedy rough selection algorithm is applied to extract head concepts, and then the concept mask fine selection method performs the extraction of core concepts. Experiments show that our approach achieves comparable performance to end-to-end black-box models, and human evaluation demonstrates that the concepts discovered by our method are interpretable and comprehensible for humans.

Understanding Multimodal Deep Neural Networks: A Concept Selection View

TL;DR

This work tackles the interpretability of multimodal DNNs, particularly CLIP, by replacing expert-defined concepts with automatically discovered ones. It introduces a two-stage Concept Selection Model (CSM) that first greedily identifies head concepts via variance-based rough selection and then refines them with a learnable mask to obtain core concepts for a linear classifier, all without human priors. The approach reveals a long-tail distribution of concepts and demonstrates that a compact set of core concepts can achieve accuracy comparable to end-to-end black-box models while enabling model debugging and human-centered evaluation. This yields more transparent decision-making in multimodal models and offers a pathway for robust, data-driven interpretability and few-shot improvements.

Abstract

The multimodal deep neural networks, represented by CLIP, have generated rich downstream applications owing to their excellent performance, thus making understanding the decision-making process of CLIP an essential research topic. Due to the complex structure and the massive pre-training data, it is often regarded as a black-box model that is too difficult to understand and interpret. Concept-based models map the black-box visual representations extracted by deep neural networks onto a set of human-understandable concepts and use the concepts to make predictions, enhancing the transparency of the decision-making process. However, these methods involve the datasets labeled with fine-grained attributes by expert knowledge, which incur high costs and introduce excessive human prior knowledge and bias. In this paper, we observe the long-tail distribution of concepts, based on which we propose a two-stage Concept Selection Model (CSM) to mine core concepts without introducing any human priors. The concept greedy rough selection algorithm is applied to extract head concepts, and then the concept mask fine selection method performs the extraction of core concepts. Experiments show that our approach achieves comparable performance to end-to-end black-box models, and human evaluation demonstrates that the concepts discovered by our method are interpretable and comprehensible for humans.
Paper Structure (18 sections, 2 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 18 sections, 2 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Left: the distribution of the sorted concept variances. Middle: the Spearman correlation coefficients of concept variances between any two datasets. Right: the number of concepts shared in the top 1000 concepts with the highest variances between any two datasets.
  • Figure 2: Up: In the Visual Genome dataset, each image is accompanied by corresponding textual descriptions, which are transformed into scene graphs, with each word represented as an atomic node. Down: Two-stage concept selection model: a rough selection is utilized to obtain the head concepts from the concept library, and subsequently a fine selection is applied to identify the core concepts from the head concepts.
  • Figure 3: The variation in accuracy of the CSM as the concept quantity increases on CIFAR-10 and CIFAR-100.
  • Figure 4: Accuracy comparison between CSM and linear probing. The x-axis indicates the number of labeled images per class.
  • Figure 5: The decision process of the concept-based models.
  • ...and 1 more figures