Table of Contents
Fetching ...

Explaining Deep Learning for ECG Analysis: Building Blocks for Auditing and Knowledge Discovery

Patrick Wagner, Temesgen Mehari, Wilhelm Haverkamp, Nils Strodthoff

TL;DR

This work tackles the transparency gap in DL-based ECG analysis by introducing a dual XAI framework that combines local attributions with global concept explanations, augmented by beat- and segment-level glocal aggregation for dataset-wide auditing and knowledge discovery. Using PTB-XL, two CNNs (LeNet and XResNet) are trained for multi-label classification, and a battery of post-hoc explanations (GradCAM, Saliency, IG, LRP) is evaluated alongside sanity checks based on ECG parameter regression. The glocal analyses reveal attribution patterns that align with established cardiology criteria for LVH, CLBBB, and MI, while global TCAV analyses confirm consistent exploitation of expert concepts across models. Through clustering aligned attributions, the study uncovers MI subtypes and ASMI subgroups, demonstrating XAI’s potential for auditing and hypothesis generation in clinical ECG analysis.

Abstract

Deep neural networks have become increasingly popular for analyzing ECG data because of their ability to accurately identify cardiac conditions and hidden clinical factors. However, the lack of transparency due to the black box nature of these models is a common concern. To address this issue, explainable AI (XAI) methods can be employed. In this study, we present a comprehensive analysis of post-hoc XAI methods, investigating the local (attributions per sample) and global (based on domain expert concepts) perspectives. We have established a set of sanity checks to identify sensible attribution methods, and we provide quantitative evidence in accordance with expert rules. This dataset-wide analysis goes beyond anecdotal evidence by aggregating data across patient subgroups. Furthermore, we demonstrate how these XAI techniques can be utilized for knowledge discovery, such as identifying subtypes of myocardial infarction. We believe that these proposed methods can serve as building blocks for a complementary assessment of the internal validity during a certification process, as well as for knowledge discovery in the field of ECG analysis.

Explaining Deep Learning for ECG Analysis: Building Blocks for Auditing and Knowledge Discovery

TL;DR

This work tackles the transparency gap in DL-based ECG analysis by introducing a dual XAI framework that combines local attributions with global concept explanations, augmented by beat- and segment-level glocal aggregation for dataset-wide auditing and knowledge discovery. Using PTB-XL, two CNNs (LeNet and XResNet) are trained for multi-label classification, and a battery of post-hoc explanations (GradCAM, Saliency, IG, LRP) is evaluated alongside sanity checks based on ECG parameter regression. The glocal analyses reveal attribution patterns that align with established cardiology criteria for LVH, CLBBB, and MI, while global TCAV analyses confirm consistent exploitation of expert concepts across models. Through clustering aligned attributions, the study uncovers MI subtypes and ASMI subgroups, demonstrating XAI’s potential for auditing and hypothesis generation in clinical ECG analysis.

Abstract

Deep neural networks have become increasingly popular for analyzing ECG data because of their ability to accurately identify cardiac conditions and hidden clinical factors. However, the lack of transparency due to the black box nature of these models is a common concern. To address this issue, explainable AI (XAI) methods can be employed. In this study, we present a comprehensive analysis of post-hoc XAI methods, investigating the local (attributions per sample) and global (based on domain expert concepts) perspectives. We have established a set of sanity checks to identify sensible attribution methods, and we provide quantitative evidence in accordance with expert rules. This dataset-wide analysis goes beyond anecdotal evidence by aggregating data across patient subgroups. Furthermore, we demonstrate how these XAI techniques can be utilized for knowledge discovery, such as identifying subtypes of myocardial infarction. We believe that these proposed methods can serve as building blocks for a complementary assessment of the internal validity during a certification process, as well as for knowledge discovery in the field of ECG analysis.
Paper Structure (27 sections, 10 equations, 13 figures, 4 tables)

This paper contains 27 sections, 10 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Conceptual summary of the XAI study for ECG: We discuss two different ways of investigating consistent model behavior (1) through aggregation of local attribution maps across entire patient groups in the form of so-called glocal attribution maps, which can also be effectively used for knowledge discovery and (2) by using the global XAI method to verify if cardiologists' expert concepts are consistently exploited.
  • Figure 3: Results of the three experiments described in \ref{['sec:sanity']}: P-wave amplitude (\ref{['fig:p_wave_XResNet']}, \ref{['fig:p_wave_lenet']}), R-peak amplitude (\ref{['fig:r_peak_XResNet']}, \ref{['fig:r_peak_lenet']} ) and T-wave amplitude (\ref{['fig:t_wave_lenet']}) for LeNet and XResNet, respectively. Each subplot is organized in the same way: On the left, we show spatial specificities with different attribution methods color-coded. In the lower right plot, we show temporal specificities with different attribution methods color-coded. These properties are computed for all samples and their attributions are concatenated to allow for analysis across the whole dataset. For spatial specificity, we consider boxplots, where the leads are on the x-axis and the specificity is on the y-axis. For temporal specificity, we consider continuous line-plots, where time is on the x-axis and the temporal specificity is on the y-axis. In the upper right plot, we provide a median beat as a reference for better localization of time steps in the temporal specificity plot. We compute the median beat attributions across the whole dataset and scale them to have a norm of one. If only the segment in the lead under consideration was relevant, the spatial specificity would be one, and the temporal specificity strongly peaked around the corresponding segment. In terms of spatial specificity, saliency shows the highest specificity among the limb leads with a comparably small variance. For temporal specificity, all methods attribute more relevance to the QRS complex (comprising the Q-, R-, and S-peaks) rather than to the interval in question, which questions their validity.
  • Figure 4: Results of the glocal (dataset-wide) analysis for saliency as an attribution method for a XResNet model. Here, we consider five classes: 1. NORM normal ECG as reference 2. LVH left ventricular hypertrophy 3. CLBBB complete left branch bundle block 4. IMI (prior) inferior myocardial infarction and 5. AMI (prior) anterior myocardial infarction. For each class, we aggregate a median beat for the top 100 predictions per class and also provide the mean of ground truth labels (gray bars) as compared to the prediction (red bars) below each plot (see LVH for inter dependencies with _ISC). On top of each plot, we visualize the absolute attributions color-coded, where deep red indicates high attribution for the respective diagnosis (e.g. the QS-complex in V1 and V2 for AMI is highly relevant). At the bottom, we show the relevance distribution broken down according to ECG segments with the top 7 segments with thehighest relevance per segment length marked, which allows for quantitative statements about the relevance distribution. These show good agreement with the relevant segments used in decision rules from clinical literature.
  • Figure 5: Concept-based analysis: Investigating which of five concepts (from top to bottom: QRS-CLBBB, SLI-LVH, QWAVES-MI, AGE>75, SEX=FEMALE) are used for the prediction of a certain class in LeNet(left) vs. XResNet(right). Within a block (i.e., a particular concept-model combination to be tested) rows denote different layers in the model and columns represent the different output classes. Each block is color-coded according to the corresponding mean TCAV score indicating whether the concept is used for/against the class under consideration. Stars indicate confidence intervals for the TCAV score that are sufficiently narrow and do not overlap with 0.5, see the text description for details, i.e., correspond to cases where concepts are consistently exploited. The numbers in brackets are the respective CAV accuracies, which describe how well a concept can be linearly separated, where blocks with insufficient accuracy are grayed out.
  • Figure 6: Results of the experiments as described in \ref{['sec:discovery_confirmative']}. T-distributed stochastic neighbor embeddings (TSNE) (with default parameters each) of three different representations extracted from a XResNet model: \ref{['fig:discovery_input']} for the input median beats, \ref{['fig:discovery_features']} for the hidden features after global pooling and \ref{['fig:discovery_relevance']} for the saliency attributions. In each subplot, we color-coded the ground truth labels left and the resulting clustering on the right. The plots reveal that (aligned) attributions are the most effective input representation for subclass discovery.
  • ...and 8 more figures