Table of Contents
Fetching ...

Dynamic Residual Encoding with Slide-Level Contrastive Learning for End-to-End Whole Slide Image Representation

Jing Jin, Xu Liu, Te Gao, Zhihong Shi, Yixiong Liang, Ruiqing Zheng, Hulin Kuang, Min Zeng, Shichao Kan

TL;DR

This work tackles the challenge of learning end-to-end representations for gigapixel whole slide images by introducing Dynamic Residual Encoding with Slide-Level Contrastive Learning (DRE-SLCL). It combines a memory bank of tile features, a fixed K-means VLAD codebook for residual encoding, and a slide-level cross-modal contrastive objective with pathology reports encoded by LLaMA2-7B, enabling end-to-end optimization across tile-level and slide-level tasks. The approach yields strong improvements in cancer subtyping, cancer recognition, and gene mutation prediction on TCGA and CPTAC lung datasets, while maintaining a compact model footprint (<27M parameters) and feasible training times on a single GPU. This integration of dynamic residual encoding with cross-modal supervision demonstrates practical potential for scalable, data-efficient computational pathology and could extend to broader cancer types and clinical settings.

Abstract

Whole Slide Image (WSI) representation is critical for cancer subtyping, cancer recognition and mutation prediction.Training an end-to-end WSI representation model poses significant challenges, as a standard gigapixel slide can contain tens of thousands of image tiles, making it difficult to compute gradients of all tiles in a single mini-batch due to current GPU limitations. To address this challenge, we propose a method of dynamic residual encoding with slide-level contrastive learning (DRE-SLCL) for end-to-end WSI representation. Our approach utilizes a memory bank to store the features of tiles across all WSIs in the dataset. During training, a mini-batch usually contains multiple WSIs. For each WSI in the batch, a subset of tiles is randomly sampled and their features are computed using a tile encoder. Then, additional tile features from the same WSI are selected from the memory bank. The representation of each individual WSI is generated using a residual encoding technique that incorporates both the sampled features and those retrieved from the memory bank. Finally, the slide-level contrastive loss is computed based on the representations and histopathology reports ofthe WSIs within the mini-batch. Experiments conducted over cancer subtyping, cancer recognition, and mutation prediction tasks proved the effectiveness of the proposed DRE-SLCL method.

Dynamic Residual Encoding with Slide-Level Contrastive Learning for End-to-End Whole Slide Image Representation

TL;DR

This work tackles the challenge of learning end-to-end representations for gigapixel whole slide images by introducing Dynamic Residual Encoding with Slide-Level Contrastive Learning (DRE-SLCL). It combines a memory bank of tile features, a fixed K-means VLAD codebook for residual encoding, and a slide-level cross-modal contrastive objective with pathology reports encoded by LLaMA2-7B, enabling end-to-end optimization across tile-level and slide-level tasks. The approach yields strong improvements in cancer subtyping, cancer recognition, and gene mutation prediction on TCGA and CPTAC lung datasets, while maintaining a compact model footprint (<27M parameters) and feasible training times on a single GPU. This integration of dynamic residual encoding with cross-modal supervision demonstrates practical potential for scalable, data-efficient computational pathology and could extend to broader cancer types and clinical settings.

Abstract

Whole Slide Image (WSI) representation is critical for cancer subtyping, cancer recognition and mutation prediction.Training an end-to-end WSI representation model poses significant challenges, as a standard gigapixel slide can contain tens of thousands of image tiles, making it difficult to compute gradients of all tiles in a single mini-batch due to current GPU limitations. To address this challenge, we propose a method of dynamic residual encoding with slide-level contrastive learning (DRE-SLCL) for end-to-end WSI representation. Our approach utilizes a memory bank to store the features of tiles across all WSIs in the dataset. During training, a mini-batch usually contains multiple WSIs. For each WSI in the batch, a subset of tiles is randomly sampled and their features are computed using a tile encoder. Then, additional tile features from the same WSI are selected from the memory bank. The representation of each individual WSI is generated using a residual encoding technique that incorporates both the sampled features and those retrieved from the memory bank. Finally, the slide-level contrastive loss is computed based on the representations and histopathology reports ofthe WSIs within the mini-batch. Experiments conducted over cancer subtyping, cancer recognition, and mutation prediction tasks proved the effectiveness of the proposed DRE-SLCL method.

Paper Structure

This paper contains 28 sections, 9 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: WSI representation learning. (a) The pipeline of the two-stage learning methods and (b) the proposed end-to-end learning method.
  • Figure 2: The end-to-end whole slide image representation framework with the proposed dynamic residual encoding with slide-level contrastive learning (DRE-SLCL).
  • Figure 3: Ablation results for TP53 gene mutation prediction.