Table of Contents
Fetching ...

Mechanistic understanding and validation of large AI models with SemanticLens

Maximilian Dreyer, Jim Berend, Tobias Labarta, Johanna Vielhaben, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek

TL;DR

SemanticLens proposes a universal mechanistic interpretability framework that maps hidden knowledge encoded by neural components into the semantic space of foundation models (e.g., CLIP). By embedding concept examples and relevance scores into a shared semantic space, it enables scalable search, labeling, description, cross-model comparison, and concept audits, linking knowledge to data and predictions. The approach is demonstrated on vision models, enabling discovery of spurious correlations, evaluation of ABCDE melanoma cues, and targeted model corrections via pruning or retraining, with open-source code and a demo. This work advances transparent verification of large AI models and provides practical tools to align model reasoning with human expectations and safety standards. It also introduces computable human-interpretability measures (clarity, similarity, redundancy, polysemanticity) and validates their alignment with human perception through user studies. Overall, SemanticLens offers a scalable, input-free means to audit and improve trustworthiness of foundation-model–driven AI systems across domains.

Abstract

Unlike human-engineered systems such as aeroplanes, where each component's role and dependencies are well understood, the inner workings of AI models remain largely opaque, hindering verifiability and undermining trust. This paper introduces SemanticLens, a universal explanation method for neural networks that maps hidden knowledge encoded by components (e.g., individual neurons) into the semantically structured, multimodal space of a foundation model such as CLIP. In this space, unique operations become possible, including (i) textual search to identify neurons encoding specific concepts, (ii) systematic analysis and comparison of model representations, (iii) automated labelling of neurons and explanation of their functional roles, and (iv) audits to validate decision-making against requirements. Fully scalable and operating without human input, SemanticLens is shown to be effective for debugging and validation, summarizing model knowledge, aligning reasoning with expectations (e.g., adherence to the ABCDE-rule in melanoma classification), and detecting components tied to spurious correlations and their associated training data. By enabling component-level understanding and validation, the proposed approach helps bridge the "trust gap" between AI models and traditional engineered systems. We provide code for SemanticLens on https://github.com/jim-berend/semanticlens and a demo on https://semanticlens.hhi-research-insights.eu.

Mechanistic understanding and validation of large AI models with SemanticLens

TL;DR

SemanticLens proposes a universal mechanistic interpretability framework that maps hidden knowledge encoded by neural components into the semantic space of foundation models (e.g., CLIP). By embedding concept examples and relevance scores into a shared semantic space, it enables scalable search, labeling, description, cross-model comparison, and concept audits, linking knowledge to data and predictions. The approach is demonstrated on vision models, enabling discovery of spurious correlations, evaluation of ABCDE melanoma cues, and targeted model corrections via pruning or retraining, with open-source code and a demo. This work advances transparent verification of large AI models and provides practical tools to align model reasoning with human expectations and safety standards. It also introduces computable human-interpretability measures (clarity, similarity, redundancy, polysemanticity) and validates their alignment with human perception through user studies. Overall, SemanticLens offers a scalable, input-free means to audit and improve trustworthiness of foundation-model–driven AI systems across domains.

Abstract

Unlike human-engineered systems such as aeroplanes, where each component's role and dependencies are well understood, the inner workings of AI models remain largely opaque, hindering verifiability and undermining trust. This paper introduces SemanticLens, a universal explanation method for neural networks that maps hidden knowledge encoded by components (e.g., individual neurons) into the semantically structured, multimodal space of a foundation model such as CLIP. In this space, unique operations become possible, including (i) textual search to identify neurons encoding specific concepts, (ii) systematic analysis and comparison of model representations, (iii) automated labelling of neurons and explanation of their functional roles, and (iv) audits to validate decision-making against requirements. Fully scalable and operating without human input, SemanticLens is shown to be effective for debugging and validation, summarizing model knowledge, aligning reasoning with expectations (e.g., adherence to the ABCDE-rule in melanoma classification), and detecting components tied to spurious correlations and their associated training data. By enabling component-level understanding and validation, the proposed approach helps bridge the "trust gap" between AI models and traditional engineered systems. We provide code for SemanticLens on https://github.com/jim-berend/semanticlens and a demo on https://semanticlens.hhi-research-insights.eu.
Paper Structure (126 sections, 21 equations, 42 figures, 11 tables)

This paper contains 126 sections, 21 equations, 42 figures, 11 tables.

Figures (42)

  • Figure 1: Embedding the model components in an understandable semantic space allows to systematically and more easily understand the inner workings of large neural networks. a) In order to turn the incomprehensible latent feature space (hidden knowledge) into an understandable representation, we leverage a foundation model $\mathcal{F}$ that serves as a semantic expert. Concretely, for each component of the analysed model $\mathcal{M}$, concept examples $\mathcal{E}$ are extracted from the dataset, representing samples that induce high stimuli (i.e., activate the component), and embedded in the latent space of the foundation model resulting in a semantic representation $\boldsymbol{\vartheta}$. Further, relevance scores $\mathcal{R}$ for all components are collected, which illustrate their role in decision-making. b) This new understandable model representation (i.e., set of $\boldsymbol{\vartheta}$'s, potentially linked to $\mathcal{E}$'s and $\mathcal{R}$'s) enables to systematically search, describe, structure, and compare internal knowledge of AI models. It further allows to audit alignment to human expectation and opens-up ways to evaluate and optimize human-interpretability.
  • Figure 2: allows to systematically understand the internal knowledge and inference of neural networks. a) Via search engine-like queries, one can probe for knowledge referring to, e.g., (racial) biases, data artefacts, or specific knowledge of interest. b) A low-dimensional UMAP projection of the semantic embeddings provides a structured overview of the model's knowledge, where each point corresponds to the encoded concept of a model component. By searching for human-defined concepts, we can add descriptions to all parts of the semantic space. c) Having grouped the knowledge into concepts, attribution graphs reveal where concepts are encoded in the model and how they are utilized (and interconnected) for inference. For predicting Ox, we learn that ox-cart related background concepts are used. Importantly, we can also identify relevant knowledge that could not be labelled, and should be manually inspected by the user. d) The set of unexpected concepts includes Indian person, palm tree, and watermark concepts, which correlate in the dataset with Ox. We can further find other affected output classes, e.g., "butcher shop", "scale" and "ricksha" for the Indian person concept.
  • Figure 3: Using to audit models and check if their reasoning aligns with human expectation. a) In a first step, a set of valid and spurious concepts is defined via text descriptions, e.g., curved horns or palm tree for "Ox" detection, respectively. Afterwards, we check which model components encode for either spurious or valid concepts, both or neither. The size of each dot in the chart represents the importance of a component for "Ox" detections. We learn, that the ResNet50v2 relies on Indian person, palm tree and cart concepts. Lastly, we can test our model, and try to distinguish the "Ox" output logits on "Ox" images (from the test dataset) and diffusion-based images with spurious features only. When multiple spurious features are present, as for Indian person pulling a cart under palm trees, model outputs become more difficult to separate, indicated by a lower AUC score. b) When auditing the ResNet's alignment to valid concepts for 26 ImageNet classes, we find that in all cases, spurious or background concepts are used.
  • Figure 4: Using to find and correct bugs in medical models that detect melanoma skin cancer. a) The ABCDE-rule is a popular guide for visual melanoma clues. We expect models to learn several concepts corresponding to the ABCDE-rule, as well as other melanoma-unrelated indications (such as regular border) or spurious concepts, including hairs or band aid. b) In semantic space visualized via a UMAP projection, we can identify valid concepts, such as blue white veil for "melanoma", but also spurious ones such as red skin or ruler. c) When investigating the importance of concepts, we find that red skin or band-aid concepts are strongly used for the "other" (non-melanoma) class. Also ruler concepts are used with slightly higher relevance for "melanoma". d) We can improve safety and robustness of our model by either changing the model and remove spurious components, or retrain the model on augmented data. Whereas both approaches lead to improved clean performance, the influence of artefacts is only significantly reduced via re-training.
  • Figure 5: We introduce computable human-interpretability measures that are useful to rate and improve model interpretability: "clarity" for how clear and easy it is to understand the common theme of concept examples, "polysemanticity" describes if multiple distinct semantics are present in the concept examples, "similarity" for the similarity of concepts, and "redundancy" which describes the degree of redundancies in a set of concepts. a) Our computable measures align with human perception in user studies, resulting in correlation scores above 0.74. Generally, more recent and performant foundation models lead to higher correlation scores. b) Interpretability differs strongly for common pre-trained models. Usually, or smaller and less performant convolutional models show lower interpretability. c) We can optimize model interpretability wrt. hyperparameter choices, such as drop-out or activation sparsity regularization during training. Whereas drop-out leads to more redundancies besides improved clarity of concepts, applying a sparsity loss improves interpretability overall.
  • ...and 37 more figures