Table of Contents
Fetching ...

Prototype-based Interpretable Breast Cancer Prediction Models: Analysis and Challenges

Shreyasi Pathak, Jörg Schlötterer, Jeroen Veltman, Jeroen Geerdink, Maurice van Keulen, Christin Seifert

TL;DR

The paper tackles the challenge of deploying interpretable AI in breast cancer prediction by introducing a Prototype Evaluation Framework for Coherence (PEF-Coh) to quantitatively assess prototype quality using domain knowledge. It applies three state-of-the-art prototype-based models (ProtoPNet, BRAIxProtoPNet++, PIP-Net) to mammography datasets CBIS-DDSM, CMMD, and VinDr-Mammo, comparing them against black-box baselines. The findings show prototype-based methods can match or approach black-box performance and offer ROI-focused explanations, but the learned prototypes often lack coherence, purity, and diversity, indicating substantial room for improvement. The authors advocate for systematic prototype evaluation in high-stakes medical decisions and discuss future directions, including user studies, visualization enhancements, and seamless clinical workflow integration.

Abstract

Deep learning models have achieved high performance in medical applications, however, their adoption in clinical practice is hindered due to their black-box nature. Self-explainable models, like prototype-based models, can be especially beneficial as they are interpretable by design. However, if the learnt prototypes are of low quality then the prototype-based models are as good as black-box. Having high quality prototypes is a pre-requisite for a truly interpretable model. In this work, we propose a prototype evaluation framework for coherence (PEF-C) for quantitatively evaluating the quality of the prototypes based on domain knowledge. We show the use of PEF-C in the context of breast cancer prediction using mammography. Existing works on prototype-based models on breast cancer prediction using mammography have focused on improving the classification performance of prototype-based models compared to black-box models and have evaluated prototype quality through anecdotal evidence. We are the first to go beyond anecdotal evidence and evaluate the quality of the mammography prototypes systematically using our PEF-C. Specifically, we apply three state-of-the-art prototype-based models, ProtoPNet, BRAIxProtoPNet++ and PIP-Net on mammography images for breast cancer prediction and evaluate these models w.r.t. i) classification performance, and ii) quality of the prototypes, on three public datasets. Our results show that prototype-based models are competitive with black-box models in terms of classification performance, and achieve a higher score in detecting ROIs. However, the quality of the prototypes are not yet sufficient and can be improved in aspects of relevance, purity and learning a variety of prototypes. We call the XAI community to systematically evaluate the quality of the prototypes to check their true usability in high stake decisions and improve such models further.

Prototype-based Interpretable Breast Cancer Prediction Models: Analysis and Challenges

TL;DR

The paper tackles the challenge of deploying interpretable AI in breast cancer prediction by introducing a Prototype Evaluation Framework for Coherence (PEF-Coh) to quantitatively assess prototype quality using domain knowledge. It applies three state-of-the-art prototype-based models (ProtoPNet, BRAIxProtoPNet++, PIP-Net) to mammography datasets CBIS-DDSM, CMMD, and VinDr-Mammo, comparing them against black-box baselines. The findings show prototype-based methods can match or approach black-box performance and offer ROI-focused explanations, but the learned prototypes often lack coherence, purity, and diversity, indicating substantial room for improvement. The authors advocate for systematic prototype evaluation in high-stakes medical decisions and discuss future directions, including user studies, visualization enhancements, and seamless clinical workflow integration.

Abstract

Deep learning models have achieved high performance in medical applications, however, their adoption in clinical practice is hindered due to their black-box nature. Self-explainable models, like prototype-based models, can be especially beneficial as they are interpretable by design. However, if the learnt prototypes are of low quality then the prototype-based models are as good as black-box. Having high quality prototypes is a pre-requisite for a truly interpretable model. In this work, we propose a prototype evaluation framework for coherence (PEF-C) for quantitatively evaluating the quality of the prototypes based on domain knowledge. We show the use of PEF-C in the context of breast cancer prediction using mammography. Existing works on prototype-based models on breast cancer prediction using mammography have focused on improving the classification performance of prototype-based models compared to black-box models and have evaluated prototype quality through anecdotal evidence. We are the first to go beyond anecdotal evidence and evaluate the quality of the mammography prototypes systematically using our PEF-C. Specifically, we apply three state-of-the-art prototype-based models, ProtoPNet, BRAIxProtoPNet++ and PIP-Net on mammography images for breast cancer prediction and evaluate these models w.r.t. i) classification performance, and ii) quality of the prototypes, on three public datasets. Our results show that prototype-based models are competitive with black-box models in terms of classification performance, and achieve a higher score in detecting ROIs. However, the quality of the prototypes are not yet sufficient and can be improved in aspects of relevance, purity and learning a variety of prototypes. We call the XAI community to systematically evaluate the quality of the prototypes to check their true usability in high stake decisions and improve such models further.
Paper Structure (17 sections, 5 figures, 3 tables)

This paper contains 17 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Global visualization of 3 prototype-based models trained on CBIS-DDSM dataset. Each row represents one prototype visualized with the top-10 activated image patches from the training set. Example of a prototype description -- second row in ProtoPNet shows mass ROI of irregular shape and spiculated margin.
  • Figure 2: Global visualization of 3 prototype-based models trained on CMMD dataset. Each row represents one prototype visualized with the top-10 activated image patches from the training set. Example of a prototype description - fifth row in ProtoPNet shows calcification abnormality.
  • Figure 3: Local explanation from ProtoPNet showing the top-3 activated prototypes for the malignant class and the benign class. Example image: CMMD, malignant test case D2-0249, view RCC, predicted class malignant.
  • Figure 4: Local explanation from BRAIxProtoPNet++ showing the top-3 activated prototypes for the malignant class and the benign class. Example image: CMMD, malignant test case D2-0249, view RCC, predicted class malignant.
  • Figure 5: Local explanation from PIP-Net showing the top-3 activated prototypes for the malignant class and the benign class. Example image: CMMD, malignant test case D2-0249, view RCC, predicted class malignant.