Table of Contents
Fetching ...

SE3D: A Framework For Saliency Method Evaluation In 3D Imaging

Mariusz Wiśniewski, Loris Giulivi, Giacomo Boracchi

TL;DR

This work tackles the lack of quantitative benchmarks for explaining 3D CNNs in medical and real-world imaging. It introduces SE3D, a framework built on modified 3D datasets (ShapeNet, ScanNet, BraTS) and novel WSOL/WSSS metrics to rigorously evaluate 3D saliency methods, including both 3D-specific approaches and 2D-extended methods. The empirical study finds that although 3D-specific saliency methods like Saliency Tubes and Respond-CAM often outperform 2D extensions, all methods show substantial localization gaps in 3D, with notable weaknesses on medical imaging data; this signals a clear need for new 3D explainability techniques. SE3D lays groundwork for safer deployment of 3D CNNs by providing standardized evaluation and a path for improving 3D explanations and WSOL/WSSS solutions on volumetric data.

Abstract

For more than a decade, deep learning models have been dominating in various 2D imaging tasks. Their application is now extending to 3D imaging, with 3D Convolutional Neural Networks (3D CNNs) being able to process LIDAR, MRI, and CT scans, with significant implications for fields such as autonomous driving and medical imaging. In these critical settings, explaining the model's decisions is fundamental. Despite recent advances in Explainable Artificial Intelligence, however, little effort has been devoted to explaining 3D CNNs, and many works explain these models via inadequate extensions of 2D saliency methods. A fundamental limitation to the development of 3D saliency methods is the lack of a benchmark to quantitatively assess these on 3D data. To address this issue, we propose SE3D: a framework for Saliency method Evaluation in 3D imaging. We propose modifications to ShapeNet, ScanNet, and BraTS datasets, and evaluation metrics to assess saliency methods for 3D CNNs. We evaluate both state-of-the-art saliency methods designed for 3D data and extensions of popular 2D saliency methods to 3D. Our experiments show that 3D saliency methods do not provide explanations of sufficient quality, and that there is margin for future improvements and safer applications of 3D CNNs in critical fields.

SE3D: A Framework For Saliency Method Evaluation In 3D Imaging

TL;DR

This work tackles the lack of quantitative benchmarks for explaining 3D CNNs in medical and real-world imaging. It introduces SE3D, a framework built on modified 3D datasets (ShapeNet, ScanNet, BraTS) and novel WSOL/WSSS metrics to rigorously evaluate 3D saliency methods, including both 3D-specific approaches and 2D-extended methods. The empirical study finds that although 3D-specific saliency methods like Saliency Tubes and Respond-CAM often outperform 2D extensions, all methods show substantial localization gaps in 3D, with notable weaknesses on medical imaging data; this signals a clear need for new 3D explainability techniques. SE3D lays groundwork for safer deployment of 3D CNNs by providing standardized evaluation and a path for improving 3D explanations and WSOL/WSSS solutions on volumetric data.

Abstract

For more than a decade, deep learning models have been dominating in various 2D imaging tasks. Their application is now extending to 3D imaging, with 3D Convolutional Neural Networks (3D CNNs) being able to process LIDAR, MRI, and CT scans, with significant implications for fields such as autonomous driving and medical imaging. In these critical settings, explaining the model's decisions is fundamental. Despite recent advances in Explainable Artificial Intelligence, however, little effort has been devoted to explaining 3D CNNs, and many works explain these models via inadequate extensions of 2D saliency methods. A fundamental limitation to the development of 3D saliency methods is the lack of a benchmark to quantitatively assess these on 3D data. To address this issue, we propose SE3D: a framework for Saliency method Evaluation in 3D imaging. We propose modifications to ShapeNet, ScanNet, and BraTS datasets, and evaluation metrics to assess saliency methods for 3D CNNs. We evaluate both state-of-the-art saliency methods designed for 3D data and extensions of popular 2D saliency methods to 3D. Our experiments show that 3D saliency methods do not provide explanations of sufficient quality, and that there is margin for future improvements and safer applications of 3D CNNs in critical fields.
Paper Structure (9 sections, 10 equations, 5 figures, 2 tables)

This paper contains 9 sections, 10 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Example saliency map for a tumor classification model. We highlight the 2D/3D bounding boxes (center) and pixel/voxel-wise segmentations (right) of the ground truth tumor and saliency map. The saliency map is a heatmap indicating the input regions that are most relevant to the model's output. For 2D data, evaluation is typically done by measuring WSOL and WSSS performance on localization and segmentation datasets (top). In this work, we propose a framework to extend the evaluation to 3D data and models (bottom).
  • Figure 2: Samples for the proposed ShapeNet shapenet and ScanNet scannet variants. We display the sample/segmentation mask and the environment voxels.
  • Figure 3: Generation of shapenet-pairs sample. The sample $\mathbf{x}$ is obtained by juxtaposing ShapeNet sample $\mathbf{x}_C$ belonging to either $\lambda_1, \lambda_2$ and $\mathbf{x}_N$ belonging to $\Lambda \setminus \lambda_1, \lambda_2$.
  • Figure 4: Generation of brats-halves. All BraTS samples contain tumors (highlighted in white). The hemispheres, however, could be devoid of tumor after splitting, and are labelled tumor (T) or no tumor (NT) accordingly.
  • Figure 5: Extensions of the three metrics proposed by WSOLRight to 3D data. We visualize the computation of 2D and 3D metrics using an example 3D CT scan from BraTS taken as a whole volume (top) or as a series of slices (bottom). Both Max3DBoxAcc and Max3DBoxAccV2 compare IoU between the bounding boxes of the ground truth and of the saliency map. However, Max3DBoxAcc only considers the largest connected component, while Max3DBoxAccV2 matches each bounding box from the prediction to a box in the ground truth. In $c$, $d$, only the blue box matches the green ground-truth box, and not the blue-gray box. VxAP compares the GT and the saliency map in a voxel-wise fashion.