Momentum Memory for Knowledge Distillation in Computational Pathology

Yongxin Guo; Hao Lu; Onur C. Koyun; Zhengjie Zhu; Muhammet Fatih Demir; Metin Nafi Gurcan

Momentum Memory for Knowledge Distillation in Computational Pathology

Yongxin Guo, Hao Lu, Onur C. Koyun, Zhengjie Zhu, Muhammet Fatih Demir, Metin Nafi Gurcan

Abstract

Multimodal learning that integrates genomics and histopathology has shown strong potential in cancer diagnosis, yet its clinical translation is hindered by the limited availability of paired histology-genomics data. Knowledge distillation (KD) offers a practical solution by transferring genomic supervision into histopathology models, enabling accurate inference using histology alone. However, existing KD methods rely on batch-local alignment, which introduces instability due to limited within-batch comparisons and ultimately degrades performance. To address these limitations, we propose Momentum Memory Knowledge Distillation (MoMKD), a cross-modal distillation framework driven by a momentum-updated memory. This memory aggregates genomic and histopathology information across batches, effectively enlarging the supervisory context available to each mini-batch. Furthermore, we decouple the gradients of the genomics and histology branches, preventing genomic signals from dominating histology feature learning during training and eliminating the modality-gap issue at inference time. Extensive experiments on the TCGA-BRCA benchmark (HER2, PR, and ODX classification tasks) and an independent in-house testing dataset demonstrate that MoMKD consistently outperforms state-of-the-art MIL and multimodal KD baselines, delivering strong performance and generalization under histology-only inference. Overall, MoMKD establishes a robust and generalizable knowledge distillation paradigm for computational pathology.

Momentum Memory for Knowledge Distillation in Computational Pathology

Abstract

Paper Structure (32 sections, 11 equations, 4 figures, 4 tables)

This paper contains 32 sections, 11 equations, 4 figures, 4 tables.

Introduction
Related Works
Application of MIL in WSIs
Multi-modal Knowledge Distillation for WSI analysis
Methodology
Overview
Dual-Branch Encoding
Graph-based WSI encoder.
Omics encoder.
Momentum Knowledge-distillation via Cross-Modal Alignment
Momentum Memory as Knowledge Mediator
Indirect Memory-based Distillation
Semantic Anchoring via Omics Alignment.
Knowledge Transfer via WSI Alignment.
Memory Evolution via Gradient Decoupling.
...and 17 more sections

Figures (4)

Figure 1: a. the classical teacher-student knowledge distillation (KD) method. b. the correlation-based KD method. c. the proposed momentum memory KD method. Compared with a and b, the proposed method uses momentum memory for knowledge distillation to solve the batch-local problem.
Figure 2: The overall framework of the proposed momentum memory knowledge distillation framework. a) presents the multi-modal KD training process; b) indicates the proposed cross-modal alignment. Here we assume the input is a positive case so that the alignment loss is aiming to push it closer to the positive memory set (blue triangle) and pull it away to the negative memory set (yellow star); c) presents the uni-modal inference stage, and the visual interpretation on the memory.
Figure 3: Memory dynamics across biomarker prediction tasks. Each column corresponds to one task (HER2, PR, and ODX), while each row depicts one statistic of the memory storage: (upper) Active memory Ratio, and (bottom) Perplexity. The active memory ratio quantifies the proportion of memory components utilized during training, ensuring that no dead memory components emerge. The corresponding perplexity represents the effective number of active memory components ($c_j^+$ and $c_j^-$). Across all tasks, the momentum memory maintains high active ratios ($>0.75$) and perplexity, indicating stable and interpretable utilization of the memory storage.
Figure 4: The visualization result based on memory with the mapped patches in WSI. The upper figure is the positive case with the lower one is negative based on Oncotype DX test. For each memory component, we present the top-3 mapped patches here.

Momentum Memory for Knowledge Distillation in Computational Pathology

Abstract

Momentum Memory for Knowledge Distillation in Computational Pathology

Authors

Abstract

Table of Contents

Figures (4)