Table of Contents
Fetching ...

Enhancing Interpretability of Vertebrae Fracture Grading using Human-interpretable Prototypes

Poulami Sinhamahapatra, Suprosanna Shit, Anjany Sekuboyina, Malek Husseini, David Schinz, Nicolas Lenhart, Joern Menze, Jan Kirschke, Karsten Roscher, Stephan Guennemann

TL;DR

The paper tackles the challenge of interpretable vertebral fracture grading with limited data by introducing ProtoVerse, an interpretable-by-design prototype-learning framework built on a CNN backbone and a learnable prototype layer. It addresses data scarcity and class imbalance with a novel Prototype Diversity Loss and a Median-Weighted Cross-Entropy loss, producing diverse, class-consistent prototypes that locally explain decisions. ProtoVerse outperforms ProtoPNet and non-IBD baselines on VerSe'19, offering superior intra-class prototype diversity and more precise, clinically relevant visual explanations, as validated by expert radiologists. The work demonstrates the practical potential of human-interpretable DL in medical imaging, emphasizing improved transparency and trust in DL-assisted vertebral fracture grading, and points to future enhancements via human-in-the-loop prototype management and broader dataset coverage.

Abstract

Vertebral fracture grading classifies the severity of vertebral fractures, which is a challenging task in medical imaging and has recently attracted Deep Learning (DL) models. Only a few works attempted to make such models human-interpretable despite the need for transparency and trustworthiness in critical use cases like DL-assisted medical diagnosis. Moreover, such models either rely on post-hoc methods or additional annotations. In this work, we propose a novel interpretable-by-design method, ProtoVerse, to find relevant sub-parts of vertebral fractures (prototypes) that reliably explain the model's decision in a human-understandable way. Specifically, we introduce a novel diversity-promoting loss to mitigate prototype repetitions in small datasets with intricate semantics. We have experimented with the VerSe'19 dataset and outperformed the existing prototype-based method. Further, our model provides superior interpretability against the post-hoc method. Importantly, expert radiologists validated the visual interpretability of our results, showing clinical applicability.

Enhancing Interpretability of Vertebrae Fracture Grading using Human-interpretable Prototypes

TL;DR

The paper tackles the challenge of interpretable vertebral fracture grading with limited data by introducing ProtoVerse, an interpretable-by-design prototype-learning framework built on a CNN backbone and a learnable prototype layer. It addresses data scarcity and class imbalance with a novel Prototype Diversity Loss and a Median-Weighted Cross-Entropy loss, producing diverse, class-consistent prototypes that locally explain decisions. ProtoVerse outperforms ProtoPNet and non-IBD baselines on VerSe'19, offering superior intra-class prototype diversity and more precise, clinically relevant visual explanations, as validated by expert radiologists. The work demonstrates the practical potential of human-interpretable DL in medical imaging, emphasizing improved transparency and trust in DL-assisted vertebral fracture grading, and points to future enhancements via human-in-the-loop prototype management and broader dataset coverage.

Abstract

Vertebral fracture grading classifies the severity of vertebral fractures, which is a challenging task in medical imaging and has recently attracted Deep Learning (DL) models. Only a few works attempted to make such models human-interpretable despite the need for transparency and trustworthiness in critical use cases like DL-assisted medical diagnosis. Moreover, such models either rely on post-hoc methods or additional annotations. In this work, we propose a novel interpretable-by-design method, ProtoVerse, to find relevant sub-parts of vertebral fractures (prototypes) that reliably explain the model's decision in a human-understandable way. Specifically, we introduce a novel diversity-promoting loss to mitigate prototype repetitions in small datasets with intricate semantics. We have experimented with the VerSe'19 dataset and outperformed the existing prototype-based method. Further, our model provides superior interpretability against the post-hoc method. Importantly, expert radiologists validated the visual interpretability of our results, showing clinical applicability.
Paper Structure (20 sections, 5 equations, 8 figures, 6 tables)

This paper contains 20 sections, 5 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Prototype-activated regions (yellow boxes) on the fractured vertebra of a typical test sample provides human-interpretable reasoning for its fracture grade G3.
  • Figure 2: ProtoVerse architecture for learning prototypes for VCF grading. Prototypes from each class are shown as : healthy (blue), G2 (red), and G3 (green), which learns representative image parts for each class through separation and clustering loss. For example, healthy prototypes emphasise straight vertebral edges, while G2 and G3 prototypes capture the different degrees of deformities in vertebrae. Notably, our novel Diversity loss ensures the capture of visual variations within a class such as $\boldsymbol{p}_{3}^1$ and $\boldsymbol{p}_{3}^2$ highlighting different fracture regions in G3. Given a G3 input test sample, the prototype patch $\boldsymbol{p}_{3}^1$ and $\boldsymbol{p}_{3}^2$ belonging to G3 shows the strongest presence (similarity score $2.027$ and $2.092$) in various fracture regions for vertebrae of interest. While the prototype parts $\boldsymbol{p}_{2}^1, \boldsymbol{p}_{1}^1$ belonging to G2 and healthy classes have lesser similarity score. Final classification logits are trained with our MWCE loss to mitigate class imbalance.
  • Figure 3: Cosine similarity between prototype vectors obtained from ProtoPNet and our ProtoVerse. Note that cosine similarity within a class is relatively high in the case of ProtoPNet, indicating difficulties in encompassing diverse prototypes. ProtoVerse achieves a diverse set of prototypes, as reflected in cosine similarity scores. Note that the third healthy prototype in both cases is looking into the trachea, which results in totally different features than vertebrae.
  • Figure 4: Qualitative comparison of prototype learnt from ProtoPNet and ProtoVerse model. Note that ProtoPNet produces repetitive prototypes whereas ProtoVerse captures diverse prototypes.
  • Figure 5: Test-time interpretability of ProtoVerse (left) in comparison to post-hoc baselines (right) for two typical G2 fracture samples. We show the top $3$ closest prototypes (col. $3$) based on similarity score and corresponding heatmaps (col. $4$). Firstly, the similarity heatmap of ProtoVerse is localised to the fracture regions of the vertebrae (col 2). In contrast, post-hoc baselines fails to precisely localise clinically important regions in the vertebral body and focuses on the wrong vertebra. Secondly, the top $3$ prototypes visually explain why the highest activated region is important to grade the fracture type correctly. Note that, for all $3$ prototypes show positive class connections and belong to the same class.
  • ...and 3 more figures