Table of Contents
Fetching ...

How Does Pruning Impact Long-Tailed Multi-Label Medical Image Classifiers?

Gregory Holste, Ziyu Jiang, Ajay Jaiswal, Maria Hanna, Shlomo Minkowitz, Alan C. Legasto, Joanna G. Escalon, Sharon Steinberger, Mark Bittman, Thomas C. Shen, Ying Ding, Ronald M. Summers, George Shih, Yifan Peng, Zhangyang Wang

TL;DR

This study investigates how unstructured L1 pruning affects long-tailed, multi-label thorax disease classification on chest X-rays, a setting with meaningful clinical risk. By evaluating NIH-CXR-LT and MIMIC-CXR-LT across 30 random initializations and multiple sparsity levels, it analyzes overall performance with average precision, defines forgetting trajectories, and introduces pruning-identified exemplars (PIEs) to locate image-level vulnerabilities. The results show that rare diseases are disproportionately affected by pruning, with forgetting trajectories influenced by both disease frequency and co-occurrence, and PIEs clustering on challenging, multi-label CXRs; radiologists perceive PIEs as noisier and harder to diagnose. These findings provide foundational guidance for deploying pruned models in clinical imaging and point to future work on alternative pruning strategies, architectures, and training-time emphasis on PIEs to mitigate risks in deployment.

Abstract

Pruning has emerged as a powerful technique for compressing deep neural networks, reducing memory usage and inference time without significantly affecting overall performance. However, the nuanced ways in which pruning impacts model behavior are not well understood, particularly for long-tailed, multi-label datasets commonly found in clinical settings. This knowledge gap could have dangerous implications when deploying a pruned model for diagnosis, where unexpected model behavior could impact patient well-being. To fill this gap, we perform the first analysis of pruning's effect on neural networks trained to diagnose thorax diseases from chest X-rays (CXRs). On two large CXR datasets, we examine which diseases are most affected by pruning and characterize class "forgettability" based on disease frequency and co-occurrence behavior. Further, we identify individual CXRs where uncompressed and heavily pruned models disagree, known as pruning-identified exemplars (PIEs), and conduct a human reader study to evaluate their unifying qualities. We find that radiologists perceive PIEs as having more label noise, lower image quality, and higher diagnosis difficulty. This work represents a first step toward understanding the impact of pruning on model behavior in deep long-tailed, multi-label medical image classification. All code, model weights, and data access instructions can be found at https://github.com/VITA-Group/PruneCXR.

How Does Pruning Impact Long-Tailed Multi-Label Medical Image Classifiers?

TL;DR

This study investigates how unstructured L1 pruning affects long-tailed, multi-label thorax disease classification on chest X-rays, a setting with meaningful clinical risk. By evaluating NIH-CXR-LT and MIMIC-CXR-LT across 30 random initializations and multiple sparsity levels, it analyzes overall performance with average precision, defines forgetting trajectories, and introduces pruning-identified exemplars (PIEs) to locate image-level vulnerabilities. The results show that rare diseases are disproportionately affected by pruning, with forgetting trajectories influenced by both disease frequency and co-occurrence, and PIEs clustering on challenging, multi-label CXRs; radiologists perceive PIEs as noisier and harder to diagnose. These findings provide foundational guidance for deploying pruned models in clinical imaging and point to future work on alternative pruning strategies, architectures, and training-time emphasis on PIEs to mitigate risks in deployment.

Abstract

Pruning has emerged as a powerful technique for compressing deep neural networks, reducing memory usage and inference time without significantly affecting overall performance. However, the nuanced ways in which pruning impacts model behavior are not well understood, particularly for long-tailed, multi-label datasets commonly found in clinical settings. This knowledge gap could have dangerous implications when deploying a pruned model for diagnosis, where unexpected model behavior could impact patient well-being. To fill this gap, we perform the first analysis of pruning's effect on neural networks trained to diagnose thorax diseases from chest X-rays (CXRs). On two large CXR datasets, we examine which diseases are most affected by pruning and characterize class "forgettability" based on disease frequency and co-occurrence behavior. Further, we identify individual CXRs where uncompressed and heavily pruned models disagree, known as pruning-identified exemplars (PIEs), and conduct a human reader study to evaluate their unifying qualities. We find that radiologists perceive PIEs as having more label noise, lower image quality, and higher diagnosis difficulty. This work represents a first step toward understanding the impact of pruning on model behavior in deep long-tailed, multi-label medical image classification. All code, model weights, and data access instructions can be found at https://github.com/VITA-Group/PruneCXR.
Paper Structure (11 sections, 2 equations, 7 figures, 2 tables)

This paper contains 11 sections, 2 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Overall effect of pruning on disease classification performance. Presented is the mean AP (median across 30 runs) for sparsity ratios $k \in \{0, \dots, 0.95\}$ (left) and log-scale histogram of model weight magnitudes (right).
  • Figure 2: "Forgettability curves" depicting relative change in AP (median across 30 runs at each sparsity ratio) upon L1 pruning for a subset of classes.
  • Figure 3: Relationship between class "forgettability" and frequency. We characterize which classes are forgotten first (left) and which are most forgotten (right).
  • Figure 4: Mutual relationship between pairs of diseases and their forgettability curves. For each pair of NIH-CXR-LT classes, FCD is plotted against the absolute difference in log frequency (left) and the IoU between the two classes (right).
  • Figure 5: Unique characteristics of PIEs. Presented is the ratio of class prevalence (left) and number of diseases per image (right) in PIEs relative to non-PIEs. The dotted line represents the 1:1 ratio (equally frequent in PIEs vs. non-PIEs).
  • ...and 2 more figures