Table of Contents
Fetching ...

Interpretable Aneurysm Classification via 3D Concept Bottleneck Models: Integrating Morphological and Hemodynamic Clinical Features

Toqa Khaled, Ahmad Al-Kabbany

TL;DR

An end-to-end 3D Concept Bottleneck framework that maps high-dimensional neuroimaging features to a discrete set of morphological and hemodynamic concepts for aneurysm identification proves that high predictive performance can be achieved without sacrificing interpretability.

Abstract

We are concerned with the challenge of reliably classifying and assessing intracranial aneurysms using deep learning without compromising clinical transparency. While traditional black-box models achieve high predictive accuracy, their lack of inherent interpretability remains a significant barrier to clinical adoption and regulatory approval. Explainability is paramount in medical modeling to ensure that AI-driven diagnoses align with established neurosurgical principles. Unlike traditional eXplainable AI (XAI) methods -- such as saliency maps, which often provide post-hoc, non-causal visual correlations -- Concept Bottleneck Models (CBMs) offer a robust alternative by constraining the model's internal logic to human-understandable clinical indices. In this article, we propose an end-to-end 3D Concept Bottleneck framework that maps high-dimensional neuroimaging features to a discrete set of morphological and hemodynamic concepts for aneurysm identification. We implemented this pipeline using a pre-trained 3D ResNet-34 backbone and a 3D DenseNet-121 to extract features from CTA volumes, which were subsequently processed through a soft bottleneck layer representing human-interpretable clinical concepts. The model was optimized using a joint-loss function to balance diagnostic focal loss and concept mean squared error (MSE), validated via stratified five-fold cross-validation. Our results demonstrate a peak task classification accuracy of 93.33% +/- 4.5% for the ResNet-34 architecture and 91.43% +/- 5.8% for the DenseNet-121 model. Furthermore, the implementation of 8-pass Test-Time Augmentation (TTA) yielded a robust mean accuracy of 88.31%, ensuring diagnostic stability during inference. By maintaining an accuracy-generalization gap of less than 0.04, this framework proves that high predictive performance can be achieved without sacrificing interpretability.

Interpretable Aneurysm Classification via 3D Concept Bottleneck Models: Integrating Morphological and Hemodynamic Clinical Features

TL;DR

An end-to-end 3D Concept Bottleneck framework that maps high-dimensional neuroimaging features to a discrete set of morphological and hemodynamic concepts for aneurysm identification proves that high predictive performance can be achieved without sacrificing interpretability.

Abstract

We are concerned with the challenge of reliably classifying and assessing intracranial aneurysms using deep learning without compromising clinical transparency. While traditional black-box models achieve high predictive accuracy, their lack of inherent interpretability remains a significant barrier to clinical adoption and regulatory approval. Explainability is paramount in medical modeling to ensure that AI-driven diagnoses align with established neurosurgical principles. Unlike traditional eXplainable AI (XAI) methods -- such as saliency maps, which often provide post-hoc, non-causal visual correlations -- Concept Bottleneck Models (CBMs) offer a robust alternative by constraining the model's internal logic to human-understandable clinical indices. In this article, we propose an end-to-end 3D Concept Bottleneck framework that maps high-dimensional neuroimaging features to a discrete set of morphological and hemodynamic concepts for aneurysm identification. We implemented this pipeline using a pre-trained 3D ResNet-34 backbone and a 3D DenseNet-121 to extract features from CTA volumes, which were subsequently processed through a soft bottleneck layer representing human-interpretable clinical concepts. The model was optimized using a joint-loss function to balance diagnostic focal loss and concept mean squared error (MSE), validated via stratified five-fold cross-validation. Our results demonstrate a peak task classification accuracy of 93.33% +/- 4.5% for the ResNet-34 architecture and 91.43% +/- 5.8% for the DenseNet-121 model. Furthermore, the implementation of 8-pass Test-Time Augmentation (TTA) yielded a robust mean accuracy of 88.31%, ensuring diagnostic stability during inference. By maintaining an accuracy-generalization gap of less than 0.04, this framework proves that high predictive performance can be achieved without sacrificing interpretability.
Paper Structure (18 sections, 1 equation, 6 figures, 1 table)

This paper contains 18 sections, 1 equation, 6 figures, 1 table.

Figures (6)

  • Figure 1: Schematic of the 3D Soft Concept Bottleneck pipeline. Volumetric CTA data are processed through a 3D ResNet-34 or DenseNet-121 backbone to extract latent embeddings. A dedicated head predicts clinical concepts, which are concatenated with visual features to provide an interpretable, multimodal diagnostic output.
  • Figure 2: Multi-level 3D augmentation strategy. Panel A shows standard training transformations for real samples. Panel B illustrates high-magnitude regularization used exclusively for oversampled synthetic controls to prevent memorization. Panel C depicts the 8-pass test-time augmentation (TTA) ensemble used during inference to stabilize diagnostic probability.
  • Figure 3: Soft Concept Bottleneck Architecture. The model extracts a 512-dimensional latent embedding (z), which is processed by a Concept Head to predict clinical indices (c). The final diagnosis is derived from the concatenation (z$\oplus$c), utilizing both high-level visual features and interpretable clinical reasoning.
  • Figure 4: Aggregate 5-fold cross-validation learning curves. Panel (a) highlights the "unfreezing shock" and subsequent recovery in the fine-tuned ResNet-34 model. Panel (b) shows the more traditional convergence pattern of the DenseNet-121. Shaded areas represent the standard deviation across all folds.
  • Figure 5: Comprehensive diagnostic performance comparison. Top row (a--c) displays standard single-pass inference for the Overfit-Fix, Merged, and DenseNet-121 models. Bottom row (d--f) shows the corresponding 8-pass TTA results. Raw counts demonstrate that while standard inference optimizes sensitivity (up to $97.8\%$), TTA improves specificity and ensures diagnostic stability across geometric variations.
  • ...and 1 more figures