Table of Contents
Fetching ...

Attention to detail: inter-resolution knowledge distillation

Rocío del Amor, Julio Silva-Rodríguez, Adrián Colomer, Valery Naranjo

TL;DR

The paper tackles the high computational cost of classifying gigapixel whole-slide images by enabling accurate, low-resolution inference through inter-resolution knowledge distillation. It introduces an attention-constrained KD framework that transfers discriminative-region information from a high-resolution Teacher to a low-resolution Student via a novel AT^{+} loss based on Grad-CAMs, combined with standard cross-entropy and feature-matching terms. On the SICAPv2 prostate histology dataset for patch-level Gleason grading, the method yields about a 3 percentage-point gain in accuracy at $2.5\times$ and $1.25\times$ magnifications and can reach Teacher-level performance with $8\times$ lower resolution, while reducing required augmentations. This approach offers a practical path to deploying accurate digital pathology models in resource-constrained settings, enabling faster, more scalable analysis.

Abstract

The development of computer vision solutions for gigapixel images in digital pathology is hampered by significant computational limitations due to the large size of whole slide images. In particular, digitizing biopsies at high resolutions is a time-consuming process, which is necessary due to the worsening results from the decrease in image detail. To alleviate this issue, recent literature has proposed using knowledge distillation to enhance the model performance at reduced image resolutions. In particular, soft labels and features extracted at the highest magnification level are distilled into a model that takes lower-magnification images as input. However, this approach fails to transfer knowledge about the most discriminative image regions in the classification process, which may be lost when the resolution is decreased. In this work, we propose to distill this information by incorporating attention maps during training. In particular, our formulation leverages saliency maps of the target class via grad-CAMs, which guides the lower-resolution Student model to match the Teacher distribution by minimizing the l2 distance between them. Comprehensive experiments on prostate histology image grading demonstrate that the proposed approach substantially improves the model performance across different image resolutions compared to previous literature.

Attention to detail: inter-resolution knowledge distillation

TL;DR

The paper tackles the high computational cost of classifying gigapixel whole-slide images by enabling accurate, low-resolution inference through inter-resolution knowledge distillation. It introduces an attention-constrained KD framework that transfers discriminative-region information from a high-resolution Teacher to a low-resolution Student via a novel AT^{+} loss based on Grad-CAMs, combined with standard cross-entropy and feature-matching terms. On the SICAPv2 prostate histology dataset for patch-level Gleason grading, the method yields about a 3 percentage-point gain in accuracy at and magnifications and can reach Teacher-level performance with lower resolution, while reducing required augmentations. This approach offers a practical path to deploying accurate digital pathology models in resource-constrained settings, enabling faster, more scalable analysis.

Abstract

The development of computer vision solutions for gigapixel images in digital pathology is hampered by significant computational limitations due to the large size of whole slide images. In particular, digitizing biopsies at high resolutions is a time-consuming process, which is necessary due to the worsening results from the decrease in image detail. To alleviate this issue, recent literature has proposed using knowledge distillation to enhance the model performance at reduced image resolutions. In particular, soft labels and features extracted at the highest magnification level are distilled into a model that takes lower-magnification images as input. However, this approach fails to transfer knowledge about the most discriminative image regions in the classification process, which may be lost when the resolution is decreased. In this work, we propose to distill this information by incorporating attention maps during training. In particular, our formulation leverages saliency maps of the target class via grad-CAMs, which guides the lower-resolution Student model to match the Teacher distribution by minimizing the l2 distance between them. Comprehensive experiments on prostate histology image grading demonstrate that the proposed approach substantially improves the model performance across different image resolutions compared to previous literature.
Paper Structure (9 sections, 5 equations, 3 figures, 2 tables)

This paper contains 9 sections, 5 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Method overview. In the context of inter-resolution knowledge distillation, a Teacher model is trained using high-resolution images by optimizing Eq.\ref{['eq:ce']}. To enable the deployment of efficient models that can operate at low resolutions, we train a Student model by transferring the information from the frozen Teacher. We use the well-known feature-matching distillation (Eq.\ref{['eq:fm']}) and propose a novel attention-matching term, $AT^{+}$ (Eq.\ref{['eq:fm']}), which distills spatial information of relevant regions in the image by using strictly positive gradient weighting for attention generation (Eq.\ref{['eq:am']}). Both terms are combined with the standard cross-entropy loss (Eq. \ref{['eq:all']}) for the optimization of the Student model.
  • Figure 2: Ablation study of the effect of attention map normalization on the method performance. The Teacher model trained at the different resolutions is used as a baseline.
  • Figure 3: Qualitative assessment of the effect of the attention matching (AT$^{+}$) term. The top row presents original images at different resolution levels (augmentations). The second row shows the Student output to the target class trained without any knowledge distillation and the attention map produced. The last row shows the effect of distilling the knowledge from the Teacher model trained at 10x magnification. Green probabilities indicate a correctly classified sample, while red indicates the opposite.