Mechanistic understanding and validation of large AI models with SemanticLens
Maximilian Dreyer, Jim Berend, Tobias Labarta, Johanna Vielhaben, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek
TL;DR
SemanticLens proposes a universal mechanistic interpretability framework that maps hidden knowledge encoded by neural components into the semantic space of foundation models (e.g., CLIP). By embedding concept examples and relevance scores into a shared semantic space, it enables scalable search, labeling, description, cross-model comparison, and concept audits, linking knowledge to data and predictions. The approach is demonstrated on vision models, enabling discovery of spurious correlations, evaluation of ABCDE melanoma cues, and targeted model corrections via pruning or retraining, with open-source code and a demo. This work advances transparent verification of large AI models and provides practical tools to align model reasoning with human expectations and safety standards. It also introduces computable human-interpretability measures (clarity, similarity, redundancy, polysemanticity) and validates their alignment with human perception through user studies. Overall, SemanticLens offers a scalable, input-free means to audit and improve trustworthiness of foundation-model–driven AI systems across domains.
Abstract
Unlike human-engineered systems such as aeroplanes, where each component's role and dependencies are well understood, the inner workings of AI models remain largely opaque, hindering verifiability and undermining trust. This paper introduces SemanticLens, a universal explanation method for neural networks that maps hidden knowledge encoded by components (e.g., individual neurons) into the semantically structured, multimodal space of a foundation model such as CLIP. In this space, unique operations become possible, including (i) textual search to identify neurons encoding specific concepts, (ii) systematic analysis and comparison of model representations, (iii) automated labelling of neurons and explanation of their functional roles, and (iv) audits to validate decision-making against requirements. Fully scalable and operating without human input, SemanticLens is shown to be effective for debugging and validation, summarizing model knowledge, aligning reasoning with expectations (e.g., adherence to the ABCDE-rule in melanoma classification), and detecting components tied to spurious correlations and their associated training data. By enabling component-level understanding and validation, the proposed approach helps bridge the "trust gap" between AI models and traditional engineered systems. We provide code for SemanticLens on https://github.com/jim-berend/semanticlens and a demo on https://semanticlens.hhi-research-insights.eu.
