Benchmarking the Influence of Pre-training on Explanation Performance in MR Image Classification

Marta Oliveira; Rick Wilming; Benedict Clark; Céline Budding; Fabian Eitel; Kerstin Ritter; Stefan Haufe

Benchmarking the Influence of Pre-training on Explanation Performance in MR Image Classification

Marta Oliveira, Rick Wilming, Benedict Clark, Céline Budding, Fabian Eitel, Kerstin Ritter, Stefan Haufe

TL;DR

This work introduces a realistic synthetic MRI benchmark by overlaying ground-truth lesion masks on real brain scans to enable objective evaluation of explanation quality across XAI methods. It systematically investigates how pre-training (within-domain MRI vs out-of-domain ImageNet) and layer-wise fine-tuning affect both classification accuracy and explanation performance, using a VGG-16 backbone and multiple Captum explanations. The study finds a strong correlation between classifier performance and explanation quality, but shows that MRI-domain pre-training yields better explanations when accuracy is matched, while within-domain pre-training yields more stable explanations across a range of accuracies. This benchmark provides a principled framework for validating XAI methods in medical imaging and highlights the need to account for training regime when interpreting explanation heatmaps in high-stakes settings.

Abstract

Convolutional Neural Networks (CNNs) are frequently and successfully used in medical prediction tasks. They are often used in combination with transfer learning, leading to improved performance when training data for the task are scarce. The resulting models are highly complex and typically do not provide any insight into their predictive mechanisms, motivating the field of "explainable" artificial intelligence (XAI). However, previous studies have rarely quantitatively evaluated the "explanation performance" of XAI methods against ground-truth data, and transfer learning and its influence on objective measures of explanation performance has not been investigated. Here, we propose a benchmark dataset that allows for quantifying explanation performance in a realistic magnetic resonance imaging (MRI) classification task. We employ this benchmark to understand the influence of transfer learning on the quality of explanations. Experimental results show that popular XAI methods applied to the same underlying model differ vastly in performance, even when considering only correctly classified examples. We further observe that explanation performance strongly depends on the task used for pre-training and the number of CNN layers pre-trained. These results hold after correcting for a substantial correlation between explanation and classification performance.

Benchmarking the Influence of Pre-training on Explanation Performance in MR Image Classification

TL;DR

Abstract

Paper Structure (15 sections, 2 equations, 4 figures)

This paper contains 15 sections, 2 equations, 4 figures.

Introduction
Related Work
Methods
Data Generation
Pre-training
Fine-tuning
XAI methods
Explanation performance
Baselines
Experiments
Results
Qualitative analysis of explanations
Quantitative analysis of explanation performance
Discussion
Conclusion

Figures (4)

Figure 1: Example of images of the dataset created. The top row consists of axial MRI slices from the Human Connectome Project hcp healthy brain dataset, with artificial lesions added. The bottom row consists of the top row images, but with the position of these lesions contoured in blue, forming the ground-truth for an explanation.
Figure 2: Example of the lesion creation process. (A) depicts an original axial MRI slice from the Human Connectome Project and denoted as background image $B$. (B) showcases an example lesion mask $L$ composed of several lesions. (C) represents ground-truth explanations used for the XAI performance study.
Figure 3: Examples of heat maps representing importance scores attributed to individual inputs by popular XAI methods for several degrees of fine-tuning of the VGG-16 architecture. The models were selected to achieve maximal test validation accuracy. Each row corresponds to an XAI method, whereas each column corresponds to a different degree of fine-tuning from 1 convolution block (1conv) to the entire network (all). The image is divided into two vertical blocks, where importance maps obtained from models pre-trained with ImageNet data are depicted on the left, and importance maps obtained from models pre-trained with MR images are depicted on the right.
Figure 4: Quantitative explanation performance for XAI methods applied to the five best models, with different degrees of fine-tuning. The blue line and boxplots correspond, respectively, to the classification performance (accuracy) and explanation performance (precision) derived from models pre-trained with MRI data, whereas the pink line and boxlots correspond analogously to classification and explanation performance for models pre-trained with ImageNet data. The other three boxplots correspond to the performance of baseline heat maps. Yellow and orange colour correspond to the Sobel and Laplace filters respectively, and red colour to the model with random weights.

Benchmarking the Influence of Pre-training on Explanation Performance in MR Image Classification

TL;DR

Abstract

Benchmarking the Influence of Pre-training on Explanation Performance in MR Image Classification

Authors

TL;DR

Abstract

Table of Contents

Figures (4)