Enhancing Interpretability of Vertebrae Fracture Grading using Human-interpretable Prototypes
Poulami Sinhamahapatra, Suprosanna Shit, Anjany Sekuboyina, Malek Husseini, David Schinz, Nicolas Lenhart, Joern Menze, Jan Kirschke, Karsten Roscher, Stephan Guennemann
TL;DR
The paper tackles the challenge of interpretable vertebral fracture grading with limited data by introducing ProtoVerse, an interpretable-by-design prototype-learning framework built on a CNN backbone and a learnable prototype layer. It addresses data scarcity and class imbalance with a novel Prototype Diversity Loss and a Median-Weighted Cross-Entropy loss, producing diverse, class-consistent prototypes that locally explain decisions. ProtoVerse outperforms ProtoPNet and non-IBD baselines on VerSe'19, offering superior intra-class prototype diversity and more precise, clinically relevant visual explanations, as validated by expert radiologists. The work demonstrates the practical potential of human-interpretable DL in medical imaging, emphasizing improved transparency and trust in DL-assisted vertebral fracture grading, and points to future enhancements via human-in-the-loop prototype management and broader dataset coverage.
Abstract
Vertebral fracture grading classifies the severity of vertebral fractures, which is a challenging task in medical imaging and has recently attracted Deep Learning (DL) models. Only a few works attempted to make such models human-interpretable despite the need for transparency and trustworthiness in critical use cases like DL-assisted medical diagnosis. Moreover, such models either rely on post-hoc methods or additional annotations. In this work, we propose a novel interpretable-by-design method, ProtoVerse, to find relevant sub-parts of vertebral fractures (prototypes) that reliably explain the model's decision in a human-understandable way. Specifically, we introduce a novel diversity-promoting loss to mitigate prototype repetitions in small datasets with intricate semantics. We have experimented with the VerSe'19 dataset and outperformed the existing prototype-based method. Further, our model provides superior interpretability against the post-hoc method. Importantly, expert radiologists validated the visual interpretability of our results, showing clinical applicability.
