Table of Contents
Fetching ...

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning

Jiuyang Dong, Junjun Jiang, Kui Jiang, Jiahan Li, Yongbing Zhang

TL;DR

This work tackles the high inference cost of multi-instance learning on gigapixel whole-slide images by introducing HDMIL, a hierarchical distillation MIL framework. It combines a dynamic multi-instance network (DMIN) operating on high-resolution WSIs to generate instance relevance masks and a lightweight instance pre-screening network (LIPN) operating on low-resolution WSIs to predict patch relevance, enabling efficient inference with minimal performance loss. A Chebyshev-polynomials-based Kolmogorov-Arnold (CKA) classifier enhances the aggregation of bag representations. Across Camelyon16, TCGA-NSCLC, and TCGA-BRCA, HDMIL surpasses state-of-the-art MIL methods in AUC and accuracy while substantially reducing inference time (e.g., up to 28.6% on Camelyon16). These results demonstrate a practical path to fast and accurate WSI classification by discarding irrelevant patches in a principled, distillation-driven manner.

Abstract

Although multi-instance learning (MIL) has succeeded in pathological image classification, it faces the challenge of high inference costs due to processing numerous patches from gigapixel whole slide images (WSIs). To address this, we propose HDMIL, a hierarchical distillation multi-instance learning framework that achieves fast and accurate classification by eliminating irrelevant patches. HDMIL consists of two key components: the dynamic multi-instance network (DMIN) and the lightweight instance pre-screening network (LIPN). DMIN operates on high-resolution WSIs, while LIPN operates on the corresponding low-resolution counterparts. During training, DMIN are trained for WSI classification while generating attention-score-based masks that indicate irrelevant patches. These masks then guide the training of LIPN to predict the relevance of each low-resolution patch. During testing, LIPN first determines the useful regions within low-resolution WSIs, which indirectly enables us to eliminate irrelevant regions in high-resolution WSIs, thereby reducing inference time without causing performance degradation. In addition, we further design the first Chebyshev-polynomials-based Kolmogorov-Arnold classifier in computational pathology, which enhances the performance of HDMIL through learnable activation layers. Extensive experiments on three public datasets demonstrate that HDMIL outperforms previous state-of-the-art methods, e.g., achieving improvements of 3.13% in AUC while reducing inference time by 28.6% on the Camelyon16 dataset.

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning

TL;DR

This work tackles the high inference cost of multi-instance learning on gigapixel whole-slide images by introducing HDMIL, a hierarchical distillation MIL framework. It combines a dynamic multi-instance network (DMIN) operating on high-resolution WSIs to generate instance relevance masks and a lightweight instance pre-screening network (LIPN) operating on low-resolution WSIs to predict patch relevance, enabling efficient inference with minimal performance loss. A Chebyshev-polynomials-based Kolmogorov-Arnold (CKA) classifier enhances the aggregation of bag representations. Across Camelyon16, TCGA-NSCLC, and TCGA-BRCA, HDMIL surpasses state-of-the-art MIL methods in AUC and accuracy while substantially reducing inference time (e.g., up to 28.6% on Camelyon16). These results demonstrate a practical path to fast and accurate WSI classification by discarding irrelevant patches in a principled, distillation-driven manner.

Abstract

Although multi-instance learning (MIL) has succeeded in pathological image classification, it faces the challenge of high inference costs due to processing numerous patches from gigapixel whole slide images (WSIs). To address this, we propose HDMIL, a hierarchical distillation multi-instance learning framework that achieves fast and accurate classification by eliminating irrelevant patches. HDMIL consists of two key components: the dynamic multi-instance network (DMIN) and the lightweight instance pre-screening network (LIPN). DMIN operates on high-resolution WSIs, while LIPN operates on the corresponding low-resolution counterparts. During training, DMIN are trained for WSI classification while generating attention-score-based masks that indicate irrelevant patches. These masks then guide the training of LIPN to predict the relevance of each low-resolution patch. During testing, LIPN first determines the useful regions within low-resolution WSIs, which indirectly enables us to eliminate irrelevant regions in high-resolution WSIs, thereby reducing inference time without causing performance degradation. In addition, we further design the first Chebyshev-polynomials-based Kolmogorov-Arnold classifier in computational pathology, which enhances the performance of HDMIL through learnable activation layers. Extensive experiments on three public datasets demonstrate that HDMIL outperforms previous state-of-the-art methods, e.g., achieving improvements of 3.13% in AUC while reducing inference time by 28.6% on the Camelyon16 dataset.

Paper Structure

This paper contains 13 sections, 14 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: What makes inference slow? (a) Time-consuming data pre-processing: After comparing the time required for data pre-processing (WSI cropping, feature extraction) and MIL network classification, it is clear that data pre-processing is the main speed bottleneck. (b) Redundant irrelevant patches: For example, in a randomly selected WSI, numerous instances have extremely low attention scores ABMIL, indicating their minimal contribution, if any, to the bag-level classification.
  • Figure 2: Overview of our HDMIL framework. (a) During training, we start by utilize the high-resolution WSI $X_{i,HR}$ for self-distillation of DMIN, enabling it to classify $X_{i,HR}$ and generate per-instance mask $M_{i,HR}$ which indicates the relevance of each region to the bag-level classification. Afterwards we froze DMIN and employ the masks $M_{i,HR}$ to distill LIPN, which learns the contribution of each region using the low-resolution $X_{i,LR}$. (b) During inference, the LIPN can identify which patches within $X_{i,HR}$ need to be used for classification by evaluating $X_{i,LR}$. (c) The self-distillation training of DMIN on the high-resolution $X_{i,HR}$.
  • Figure 3: Visualization analysis of two randomly selected WSIs. The pathologists marked the tumor areas in the input WSIs with red lines. The dual-branch attention maps in DMIN ("Attention1" and "Attention2") are shown, and the instances selected by LIPN are marked with blue masks ("Retained Region")
  • Figure 4: The impact of the preset instance retention rate $\mathbf{r}$ (hyper-parameter) on classification performance, actual learned instance retention ratio, and inference time (seconds).