Table of Contents
Fetching ...

Looking into Concept Explanation Methods for Diabetic Retinopathy Classification

Andrea M. Storås, Josefine V. Sundgaard

TL;DR

This work investigates and compares two concept-based explanation techniques for explaining deep neural networks developed for automatic diagnosis of diabetic retinopathy: Quantitative Testing with Concept Activation Vectors and Concept Bottleneck Models.

Abstract

Diabetic retinopathy is a common complication of diabetes, and monitoring the progression of retinal abnormalities using fundus imaging is crucial. Because the images must be interpreted by a medical expert, it is infeasible to screen all individuals with diabetes for diabetic retinopathy. Deep learning has shown impressive results for automatic analysis and grading of fundus images. One drawback is, however, the lack of interpretability, which hampers the implementation of such systems in the clinic. Explainable artificial intelligence methods can be applied to explain the deep neural networks. Explanations based on concepts have shown to be intuitive for humans to understand, but have not yet been explored in detail for diabetic retinopathy grading. This work investigates and compares two concept-based explanation techniques for explaining deep neural networks developed for automatic diagnosis of diabetic retinopathy: Quantitative Testing with Concept Activation Vectors and Concept Bottleneck Models. We found that both methods have strengths and weaknesses, and choice of method should take the available data and the end user's preferences into account.

Looking into Concept Explanation Methods for Diabetic Retinopathy Classification

TL;DR

This work investigates and compares two concept-based explanation techniques for explaining deep neural networks developed for automatic diagnosis of diabetic retinopathy: Quantitative Testing with Concept Activation Vectors and Concept Bottleneck Models.

Abstract

Diabetic retinopathy is a common complication of diabetes, and monitoring the progression of retinal abnormalities using fundus imaging is crucial. Because the images must be interpreted by a medical expert, it is infeasible to screen all individuals with diabetes for diabetic retinopathy. Deep learning has shown impressive results for automatic analysis and grading of fundus images. One drawback is, however, the lack of interpretability, which hampers the implementation of such systems in the clinic. Explainable artificial intelligence methods can be applied to explain the deep neural networks. Explanations based on concepts have shown to be intuitive for humans to understand, but have not yet been explored in detail for diabetic retinopathy grading. This work investigates and compares two concept-based explanation techniques for explaining deep neural networks developed for automatic diagnosis of diabetic retinopathy: Quantitative Testing with Concept Activation Vectors and Concept Bottleneck Models. We found that both methods have strengths and weaknesses, and choice of method should take the available data and the end user's preferences into account.
Paper Structure (6 sections, 5 figures, 2 tables)

This paper contains 6 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Example fundus images representing increasing DR severity with segmentation masks of retinal lesions. Level 4 is the most severe type of DR and is associated with a high risk of blindness. Images from the FGADR dataset Zhou2021FGADR. Dark blue = microaneurysms, pink = hemorrhages, light blue = hard exudates, green = soft exudates, yellow = intra-retinal microvascular abnormalities, and red = neovascularization. Best viewed with zoom.
  • Figure 2: Schematic representation of a sequential bottleneck model predicting DR level from six concepts. The 'bottleneck layer' is the concepts predicted by a deep neural network. The predicted concepts are then provided to a logistic regression model for DR level classification.
  • Figure 3: Upper row: TCAV scores for DR levels 1 to 4, showing the mean and standard deviation for $20$ pairs of positive and negative sets for the representative test set. $\ast$ marks insignificant concepts. Lower row: Fraction of images with concepts predicted as present in the FGADR test set by the CBM. The values are normalized by the total number of images for each level in the test set.
  • Figure 4: Performance metrics for the DR classification task during test time intervention for an increasing number of concepts. Only wrongly predicted concepts were intervened on. Left side: Results for the full FGADR test set. Right side: Results for the misclassified images in the FGADR test set.
  • Figure 5: Test time intervention on selected test images with DR levels 1 (left) and 4 (right), showing how the predicted DR levels change. Inspired by Koh2020BottleneckModels.