Table of Contents
Fetching ...

B-cos Alignment for Inherently Interpretable CNNs and Vision Transformers

Moritz Böhle, Navdeeppal Singh, Mario Fritz, Bernt Schiele

TL;DR

The paper addresses the challenge of interpretable deep learning by introducing the B-cos transformation, a weight–input alignment operator that replaces standard linear units. By ensuring that sequences of B-cos layers compress to a single input-dependent linear map, the approach yields faithful explanations that directly reflect learned discriminative patterns, while retaining competitive vision performance across CNNs and Vision Transformers. The authors demonstrate both quantitative and qualitative gains in interpretability, including superior model-inherent explanations and localization metrics, and provide a bias-free normalization framework to preserve faithful summaries. This work offers a practical pathway to building inherently interpretable vision models that maintain state-of-the-art accuracy and fosters trust in DNN decisions in safety-critical settings.

Abstract

We present a new direction for increasing the interpretability of deep neural networks (DNNs) by promoting weight-input alignment during training. For this, we propose to replace the linear transformations in DNNs by our novel B-cos transformation. As we show, a sequence (network) of such transformations induces a single linear transformation that faithfully summarises the full model computations. Moreover, the B-cos transformation is designed such that the weights align with relevant signals during optimisation. As a result, those induced linear transformations become highly interpretable and highlight task-relevant features. Importantly, the B-cos transformation is designed to be compatible with existing architectures and we show that it can easily be integrated into virtually all of the latest state of the art models for computer vision - e.g. ResNets, DenseNets, ConvNext models, as well as Vision Transformers - by combining the B-cos-based explanations with normalisation and attention layers, all whilst maintaining similar accuracy on ImageNet. Finally, we show that the resulting explanations are of high visual quality and perform well under quantitative interpretability metrics.

B-cos Alignment for Inherently Interpretable CNNs and Vision Transformers

TL;DR

The paper addresses the challenge of interpretable deep learning by introducing the B-cos transformation, a weight–input alignment operator that replaces standard linear units. By ensuring that sequences of B-cos layers compress to a single input-dependent linear map, the approach yields faithful explanations that directly reflect learned discriminative patterns, while retaining competitive vision performance across CNNs and Vision Transformers. The authors demonstrate both quantitative and qualitative gains in interpretability, including superior model-inherent explanations and localization metrics, and provide a bias-free normalization framework to preserve faithful summaries. This work offers a practical pathway to building inherently interpretable vision models that maintain state-of-the-art accuracy and fosters trust in DNN decisions in safety-critical settings.

Abstract

We present a new direction for increasing the interpretability of deep neural networks (DNNs) by promoting weight-input alignment during training. For this, we propose to replace the linear transformations in DNNs by our novel B-cos transformation. As we show, a sequence (network) of such transformations induces a single linear transformation that faithfully summarises the full model computations. Moreover, the B-cos transformation is designed such that the weights align with relevant signals during optimisation. As a result, those induced linear transformations become highly interpretable and highlight task-relevant features. Importantly, the B-cos transformation is designed to be compatible with existing architectures and we show that it can easily be integrated into virtually all of the latest state of the art models for computer vision - e.g. ResNets, DenseNets, ConvNext models, as well as Vision Transformers - by combining the B-cos-based explanations with normalisation and attention layers, all whilst maintaining similar accuracy on ImageNet. Finally, we show that the resulting explanations are of high visual quality and perform well under quantitative interpretability metrics.
Paper Structure (18 sections, 23 equations, 14 figures, 4 tables)

This paper contains 18 sections, 23 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Top: Inputs $\mathbf{x}_i$ to a B-cos DenseNet-121. Bottom: B-cos explanation for class $c$ ($c$: image label). Specifically, we visualise the $c$-th row of $\mathbf{W}_{1\rightarrow L}(\mathbf{x}_i)$ as applied by the model, see \ref{['eq:collapse']}; no masking of the original image is used. For the last 2 images, we also show the explanation for the 2nd most likely class. For details on visualising $\mathbf{W}_{1\rightarrow L}(\mathbf{x}_i)$, see \ref{['sec:experiments']}.
  • Figure 2: Col. 2: BCE loss for different angles of $\mathbf{w}$ for B-cos classifiers (\ref{['eq:bcos']}) with different values of B (rows) for two classification problems. Cols. 1+3: Visualisation of the classification problems and the corresponding optimal weights (arrows) per B. For $\text{B}\mkern1.25mu{=}\mkern1.25mu1$ (first row) the weights $\mathbf{w}$ represent the decision boundary of a linear classifier. Although the red cluster is the same in both cases, the optimal weight vectors differ significantly (compare within row). In contrast, for higher values of B the weights converge to the same optimum in both tasks (see last row). The opacity of the red shading shows the strength of the positive activation of the B-cos transformation for a sample at a given position.
  • Figure 3: 2$\times$2 pointing game example. Column 1: input image. Columns 2--5: explanations for individual class logits.
  • Figure 4: Col. 1: Input images. Cols. 2-6: Explanations for different classes $c$ (top: 'horse'; bottom: 'car') of models trained with increasing B. For higher B, the model-inherent linear explanations $[\mathbf{W}_{1\rightarrow l}]_c$ increasingly align with discriminative input patterns, thus becoming more interpretable.
  • Figure 5: Accuracy (crosses) and localisation (box plots) results for a B-cos network trained with different B. While decreasing accuracy, larger B significantly improve localisation.
  • ...and 9 more figures