Table of Contents
Fetching ...

Structured Model Pruning for Efficient Inference in Computational Pathology

Mohammed Adnan, Qinle Ba, Nazim Shaikh, Shivam Kalra, Satarupa Mukherjee, Auranuch Lorsakul

TL;DR

The paper addresses the challenge of deploying large AI models in digital and computational pathology by evaluating structured pruning to reduce inference cost with minimal performance loss. It develops a pruning framework tailored for U-Net style architectures and demonstrates it on nuclei instance segmentation and classification (HoverNet) and CRC tissue classification, achieving substantial compression and latency reductions without substantial accuracy degradation. Through a comparison of pruning heuristics (L1/L2, Network Slimmer, Iterative Magnitude Pruning) and strategies (one-shot vs iterative), and by carefully handling encoder-decoder skip connections, the study shows that significant speedups are achievable while preserving task performance. The findings support deploying pruned models at the edge in DP workflows, with potential extensions to quantization and Vision Transformers to further enhance efficiency in resource-constrained clinical settings.

Abstract

Recent years have seen significant efforts to adopt Artificial Intelligence (AI) in healthcare for various use cases, from computer-aided diagnosis to ICU triage. However, the size of AI models has been rapidly growing due to scaling laws and the success of foundational models, which poses an increasing challenge to leverage advanced models in practical applications. It is thus imperative to develop efficient models, especially for deploying AI solutions under resource-constrains or with time sensitivity. One potential solution is to perform model compression, a set of techniques that remove less important model components or reduce parameter precision, to reduce model computation demand. In this work, we demonstrate that model pruning, as a model compression technique, can effectively reduce inference cost for computational and digital pathology based analysis with a negligible loss of analysis performance. To this end, we develop a methodology for pruning the widely used U-Net-style architectures in biomedical imaging, with which we evaluate multiple pruning heuristics on nuclei instance segmentation and classification, and empirically demonstrate that pruning can compress models by at least 70% with a negligible drop in performance.

Structured Model Pruning for Efficient Inference in Computational Pathology

TL;DR

The paper addresses the challenge of deploying large AI models in digital and computational pathology by evaluating structured pruning to reduce inference cost with minimal performance loss. It develops a pruning framework tailored for U-Net style architectures and demonstrates it on nuclei instance segmentation and classification (HoverNet) and CRC tissue classification, achieving substantial compression and latency reductions without substantial accuracy degradation. Through a comparison of pruning heuristics (L1/L2, Network Slimmer, Iterative Magnitude Pruning) and strategies (one-shot vs iterative), and by carefully handling encoder-decoder skip connections, the study shows that significant speedups are achievable while preserving task performance. The findings support deploying pruned models at the edge in DP workflows, with potential extensions to quantization and Vision Transformers to further enhance efficiency in resource-constrained clinical settings.

Abstract

Recent years have seen significant efforts to adopt Artificial Intelligence (AI) in healthcare for various use cases, from computer-aided diagnosis to ICU triage. However, the size of AI models has been rapidly growing due to scaling laws and the success of foundational models, which poses an increasing challenge to leverage advanced models in practical applications. It is thus imperative to develop efficient models, especially for deploying AI solutions under resource-constrains or with time sensitivity. One potential solution is to perform model compression, a set of techniques that remove less important model components or reduce parameter precision, to reduce model computation demand. In this work, we demonstrate that model pruning, as a model compression technique, can effectively reduce inference cost for computational and digital pathology based analysis with a negligible loss of analysis performance. To this end, we develop a methodology for pruning the widely used U-Net-style architectures in biomedical imaging, with which we evaluate multiple pruning heuristics on nuclei instance segmentation and classification, and empirically demonstrate that pruning can compress models by at least 70% with a negligible drop in performance.
Paper Structure (24 sections, 5 figures, 3 tables)

This paper contains 24 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of HoverNet and the proposed model pruning schema. Top: HoverNet design with three identical decoder branches. Middle: dependencies between the last convolution (conv) layer of each residual block and the skip-connected decoder layer. Horizontal red bars denote pruned conv filters (output dimension) that are pruned with the matching channel indices (vertical bars) in interdependent layers. Bottom: a 2D view of pruning consecutive conv layers. Here, each grid represents a 2D kernel (e.g. 3x3). The rows denote the output dimension (number of filters) and the columns denote the input dimension, which is equal to the number of feature maps of layer input. Pruning of conv filters from one layer (i.e red filters in first block) requires removing the kernels from the input dimension in the consecutive conv layer (i.e equivalent red filters in second block). Similarly, green filters illustrate pruning of the next two conv layers.
  • Figure 2: The effects of filter importance heuristics (Left) and pruning strategies (Right) on mPQ.
  • Figure S1: Example intra-block pruning of ResNet18 up to the 3rd residual block. In each residual block, the output dimension (channels) of the 1st convolution layer, the subsequent batch normalization layer as well as the input dimension (channels) of the 2nd convolution layer are pruned with the same ratio and indexing, determined by, for example, the filter importance ranking with L1/L2 pruners. Grayed channel dimensions are pruned with the pruning ratio of 1/a, 1/b, 1/c, etc.
  • Figure S2: Example inter-block pruning of ResNet18 up to the 3rd residual block. The yellow highlighted channel dimensions belong to a inter-connected group of layers, and thus the same pruning rario (1/i) and pruning indexing should be applied to match the channel dimensions after pruning. The bold blue highlighted channel dimension (e.g. conv1 output channel dimension) were used in our study for ranking the filter importance in L1/L2 pruners. The green highlighted channel dimensions belong to another interconnected group, only part of which is shown in this illustration. Note that 1x1 conv for downsampling enabled the green highlighted group of layers to become an independent group. Intra-channel pruning is also illustrated with the grayed channel dimensions along with their pruning ratios.
  • Figure S3: Per fold PQ for various pruning heuristics and sparsity ratios