Table of Contents
Fetching ...

Unmasking Biases and Reliability Concerns in Convolutional Neural Networks Analysis of Cancer Pathology Images

Michael Okonoda, Eder Martinez, Abhilekha Dalal, Lior Shamir

Abstract

Convolutional Neural Networks have shown promising effectiveness in identifying different types of cancer from radiographs. However, the opaque nature of CNNs makes it difficult to fully understand the way they operate, limiting their assessment to empirical evaluation. Here we study the soundness of the standard practices by which CNNs are evaluated for the purpose of cancer pathology. Thirteen highly used cancer benchmark datasets were analyzed, using four common CNN architectures and different types of cancer, such as melanoma, carcinoma, colorectal cancer, and lung cancer. We compared the accuracy of each model with that of datasets made of cropped segments from the background of the original images that do not contain clinically relevant content. Because the rendered datasets contain no clinical information, the null hypothesis is that the CNNs should provide mere chance-based accuracy when classifying these datasets. The results show that the CNN models provided high accuracy when using the cropped segments, sometimes as high as 93\%, even though they lacked biomedical information. These results show that some CNN architectures are more sensitive to bias than others. The analysis shows that the common practices of machine learning evaluation might lead to unreliable results when applied to cancer pathology. These biases are very difficult to identify, and might mislead researchers as they use available benchmark datasets to test the efficacy of CNN methods.

Unmasking Biases and Reliability Concerns in Convolutional Neural Networks Analysis of Cancer Pathology Images

Abstract

Convolutional Neural Networks have shown promising effectiveness in identifying different types of cancer from radiographs. However, the opaque nature of CNNs makes it difficult to fully understand the way they operate, limiting their assessment to empirical evaluation. Here we study the soundness of the standard practices by which CNNs are evaluated for the purpose of cancer pathology. Thirteen highly used cancer benchmark datasets were analyzed, using four common CNN architectures and different types of cancer, such as melanoma, carcinoma, colorectal cancer, and lung cancer. We compared the accuracy of each model with that of datasets made of cropped segments from the background of the original images that do not contain clinically relevant content. Because the rendered datasets contain no clinical information, the null hypothesis is that the CNNs should provide mere chance-based accuracy when classifying these datasets. The results show that the CNN models provided high accuracy when using the cropped segments, sometimes as high as 93\%, even though they lacked biomedical information. These results show that some CNN architectures are more sensitive to bias than others. The analysis shows that the common practices of machine learning evaluation might lead to unreliable results when applied to cancer pathology. These biases are very difficult to identify, and might mislead researchers as they use available benchmark datasets to test the efficacy of CNN methods.
Paper Structure (19 sections, 26 figures, 9 tables)

This paper contains 19 sections, 26 figures, 9 tables.

Figures (26)

  • Figure S1: Example of 20 $\times$ 20 cropped images from the ISIC 2017 Skin cancer diagnosis dataset and five sections of the original image: upper-left, upper-right, center, bottom-left, and bottom-right. These small sub-image contain no medical information, and therefore are not expected to be useful for identifying cancer to a level beyond mere chance. The empirical classification accuracy of each section is compared with that of the classification accuracy observed when using the original image.
  • Figure S2: Sample images from the different classes of the DermaMNIST dataset and their respective 20 $\times$ 20 cropped sections that have no meaningful medical content. Since they do not have any medically relevant information, these non-informative sub-images are not expected to be able to identify cancer.
  • Figure S3: Sample images across the MedMNIST datasets, showing the original images and their 20 $\times$ 20 cropped images. Mostly, the cropped images contain little or no medical content useful for correct diagnostics.
  • Figure S4: The classification accuracy of the BreastMNIST original image dataset alongside cropped 20 $\times$ 20 pixel sections. As expected, CNNs models can classify the original images. Surprisingly, all CNN models performed better than the random chance accuracy of 50%, when applied to the cropped sections, highlighting that these models can classify images that lack lesion-specific features.
  • Figure S5: The classification accuracy, precision, recall, and F-1 scores for the DermaMNIST dataset when each class contains 458 images.
  • ...and 21 more figures