Beyond Attention Heatmaps: How to Get Better Explanations for Multiple Instance Learning Models in Histopathology

Mina Jamshidi Idaji; Julius Hense; Tom Neuhäuser; Augustin Krause; Yanqing Luo; Oliver Eberle; Thomas Schnake; Laure Ciernik; Farnoush Rezaei Jafari; Reza Vahidimajd; Jonas Dippel; Christoph Walz; Frederick Klauschen; Andreas Mock; Klaus-Robert Müller

Beyond Attention Heatmaps: How to Get Better Explanations for Multiple Instance Learning Models in Histopathology

Mina Jamshidi Idaji, Julius Hense, Tom Neuhäuser, Augustin Krause, Yanqing Luo, Oliver Eberle, Thomas Schnake, Laure Ciernik, Farnoush Rezaei Jafari, Reza Vahidimajd, Jonas Dippel, Christoph Walz, Frederick Klauschen, Andreas Mock, Klaus-Robert Müller

TL;DR

This work provides a proof-of-concept that MIL heatmaps of a bulk gene expression prediction model can be correlated with spatial transcriptomics for biological validation, and showcases the discovery of distinct model strategies for predicting human papillomavirus infection from head and neck cancer slides.

Abstract

Multiple instance learning (MIL) has enabled substantial progress in computational histopathology, where a large amount of patches from gigapixel whole slide images are aggregated into slide-level predictions. Heatmaps are widely used to validate MIL models and to discover tissue biomarkers. Yet, the validity of these heatmaps has barely been investigated. In this work, we introduce a general framework for evaluating the quality of MIL heatmaps without requiring additional labels. We conduct a large-scale benchmark experiment to assess six explanation methods across histopathology task types (classification, regression, survival), MIL model architectures (Attention-, Transformer-, Mamba-based), and patch encoder backbones (UNI2, Virchow2). Our results show that explanation quality mostly depends on MIL model architecture and task type, with perturbation ("Single"), layer-wise relevance propagation (LRP), and integrated gradients (IG) consistently outperforming attention-based and gradient-based saliency heatmaps, which often fail to reflect model decision mechanisms. We further demonstrate the advanced capabilities of the best-performing explanation methods: (i) We provide a proof-of-concept that MIL heatmaps of a bulk gene expression prediction model can be correlated with spatial transcriptomics for biological validation, and (ii) showcase the discovery of distinct model strategies for predicting human papillomavirus (HPV) infection from head and neck cancer slides. Our work highlights the importance of validating MIL heatmaps and establishes that improved explainability can enable more reliable model validation and yield biological insights, making a case for a broader adoption of explainable AI in digital pathology. Our code is provided in a public GitHub repository: https://github.com/bifold-pathomics/xMIL/tree/xmil-journal

Beyond Attention Heatmaps: How to Get Better Explanations for Multiple Instance Learning Models in Histopathology

TL;DR

Abstract

Paper Structure (64 sections, 19 equations, 14 figures, 3 tables)

This paper contains 64 sections, 19 equations, 14 figures, 3 tables.

Introduction
Background: Multiple instance learning (MIL)
MIL formulations
MIL pipelines
Patch encoder
Aggregation
Task head
Methods
Overview
Explaining multiple instance learning
Explanation methods
Task-specific explanations
Classification
Regression
Survival
...and 49 more sections

Figures (14)

Figure 1: Overview of our explanation pipeline for multiple instance learning (MIL). a Block diagram of a general MIL approach to make predictions directly from whole slide images, and to obtain heatmaps to interpret the model predictions. The explanation methods produce patch-wise scores by explaining (only) the task head and the aggregation model, but not the patch encoder. In this work, we assess six explanation methods in the context of two patch encoders, three aggregation model architectures, and three types of prediction heads. b Composition of the task heads. In the survival head, the discrete-time model predicts hazard probabilities $\{h_k\}$ for survival time intervals $k=1, \cdots, K$, and subsequently aggregates them into a patient risk score. See Section \ref{['supp:sec:survival_preliminaries']} for details. $\sigma$ = sigmoid function; FNN = feedforward neural network.
Figure 2: Overview of the evaluation framework for comparing heatmaps and explanation methods in multiple instance learning (MIL).
Figure 3: The patch flipping procedure. Patches are sorted based on their heatmap scores either in ascending or descending order. They are then grouped into 100 mini-bags. The mini-bags are removed progressively from the main bag of patches, and the remaining patches are fed into the model. Two perturbation curves (model prediction vs. dropped fraction) are formed for the two orderings. A larger area under the curve (AUC) of the ascending (Asc.) curve is desired, whereas a smaller AUC of the descending (Desc.) curve is better. Therefore, the area between the two curves (called Symmetric Relevance Gain--SRG) quantifies the faithfulness of the heatmap. A larger SRG indicates a more faithful heatmap.
Figure 4: Datasets and model performances. The bar plots in each panel show the mean test performance over the cross-validation folds with the error bars depicting the standard deviations (see Tables \ref{['supp:tab:model_performances_virchow']} and \ref{['supp:tab:model_performances_uni']} for numerical model performance results). For the classification tasks in panel (a), the ratio of the cases in each class is depicted. In panels (b) and (c), the endpoint target's histograms are illustrated.
Figure 5: Comparison of explanation methods for two example settings. a/b Faithfulness comparison from three complementary perspectives: (Left) The distribution of Symmetric Relevance Gain (SRG) faithfulness scores per explanation method computed for each slide in the test set. (Center) Wilcoxon's signed-rank test effect sizes comparing SRG scores between explanation methods for each slide in the test set. Colors and marker sizes indicate effect size magnitude as shown on the colorbar. Triangle, square, and circle markers are used to denote the negligible ($< 0.2$), weak to moderate ($0.2-0.5$), and moderate to strong ($\geq 0.5$) effect sizes. These effect size scores assess the magnitude of the faithfulness differences between explanation methods. (Right) Mean Rank Score (MRS) per explanation method (vertical axis). Lower averaged MRS = better position in the faithfulness ranking. c/d Heatmaps for example slides from the two experimental settings. For LRP, values greater/lower than zero indicate evidence for/against the model prediction (visualized by shades of red/blue). Single and Attention (Attn) have no natural boundary between positive and negative evidence, but only provide an ordering (visualized by shades of red). In the HRD regression task, however, the $0$ value for Single may be interpreted as such a boundary; therefore, we visualize this heatmap using a red/blue color scheme as well.
...and 9 more figures

Beyond Attention Heatmaps: How to Get Better Explanations for Multiple Instance Learning Models in Histopathology

TL;DR

Abstract

Beyond Attention Heatmaps: How to Get Better Explanations for Multiple Instance Learning Models in Histopathology

Authors

TL;DR

Abstract

Table of Contents

Figures (14)