Understanding Multimodal Deep Neural Networks: A Concept Selection View
Chenming Shang, Hengyuan Zhang, Hao Wen, Yujiu Yang
TL;DR
This work tackles the interpretability of multimodal DNNs, particularly CLIP, by replacing expert-defined concepts with automatically discovered ones. It introduces a two-stage Concept Selection Model (CSM) that first greedily identifies head concepts via variance-based rough selection and then refines them with a learnable mask to obtain core concepts for a linear classifier, all without human priors. The approach reveals a long-tail distribution of concepts and demonstrates that a compact set of core concepts can achieve accuracy comparable to end-to-end black-box models while enabling model debugging and human-centered evaluation. This yields more transparent decision-making in multimodal models and offers a pathway for robust, data-driven interpretability and few-shot improvements.
Abstract
The multimodal deep neural networks, represented by CLIP, have generated rich downstream applications owing to their excellent performance, thus making understanding the decision-making process of CLIP an essential research topic. Due to the complex structure and the massive pre-training data, it is often regarded as a black-box model that is too difficult to understand and interpret. Concept-based models map the black-box visual representations extracted by deep neural networks onto a set of human-understandable concepts and use the concepts to make predictions, enhancing the transparency of the decision-making process. However, these methods involve the datasets labeled with fine-grained attributes by expert knowledge, which incur high costs and introduce excessive human prior knowledge and bias. In this paper, we observe the long-tail distribution of concepts, based on which we propose a two-stage Concept Selection Model (CSM) to mine core concepts without introducing any human priors. The concept greedy rough selection algorithm is applied to extract head concepts, and then the concept mask fine selection method performs the extraction of core concepts. Experiments show that our approach achieves comparable performance to end-to-end black-box models, and human evaluation demonstrates that the concepts discovered by our method are interpretable and comprehensible for humans.
