Table of Contents
Fetching ...

Mixed Magnification Aggregation for Generalizable Region-Level Representations in Computational Pathology

Eric Zimmermann, Julian Viret, Michal Zelechowski, James Brian Hall, Neil Tenenholtz, Adam Casson, George Shaikovski, Eugene Vorontsov, Siqi Liu, Kristen A Severson

TL;DR

This work explores a design space for pretraining the proposed mixed-magnification region aggregators and evaluates their models on transfer to biomarker prediction tasks representing various cancer types, demonstrating cancer dependent improvements in predictive performance.

Abstract

In recent years, a standard computational pathology workflow has emerged where whole slide images are cropped into tiles, these tiles are processed using a foundation model, and task-specific models are built using the resulting representations. At least 15 different foundation models have been proposed, and the vast majority are trained exclusively with tiles using the 20$\times$ magnification. However, it is well known that certain histologic features can only be discerned with larger context windows and requires a pathologist to zoom in and out when analyzing a whole slide image. Furthermore, creating 224$\times$224 pixel crops at 20$\times$ leads to a large number of tiles per slide, which can be gigapixel in size. To more accurately capture multi-resolution features and investigate the possibility of reducing the number of representations per slide, we propose a region-level mixing encoder. Our approach jointly fuses image tile representations of a mixed magnification foundation model using a masked embedding modeling pretraining step. We explore a design space for pretraining the proposed mixed-magnification region aggregators and evaluate our models on transfer to biomarker prediction tasks representing various cancer types. Results demonstrate cancer dependent improvements in predictive performance, highlighting the importance of spatial context and understanding.

Mixed Magnification Aggregation for Generalizable Region-Level Representations in Computational Pathology

TL;DR

This work explores a design space for pretraining the proposed mixed-magnification region aggregators and evaluates their models on transfer to biomarker prediction tasks representing various cancer types, demonstrating cancer dependent improvements in predictive performance.

Abstract

In recent years, a standard computational pathology workflow has emerged where whole slide images are cropped into tiles, these tiles are processed using a foundation model, and task-specific models are built using the resulting representations. At least 15 different foundation models have been proposed, and the vast majority are trained exclusively with tiles using the 20 magnification. However, it is well known that certain histologic features can only be discerned with larger context windows and requires a pathologist to zoom in and out when analyzing a whole slide image. Furthermore, creating 224224 pixel crops at 20 leads to a large number of tiles per slide, which can be gigapixel in size. To more accurately capture multi-resolution features and investigate the possibility of reducing the number of representations per slide, we propose a region-level mixing encoder. Our approach jointly fuses image tile representations of a mixed magnification foundation model using a masked embedding modeling pretraining step. We explore a design space for pretraining the proposed mixed-magnification region aggregators and evaluate our models on transfer to biomarker prediction tasks representing various cancer types. Results demonstrate cancer dependent improvements in predictive performance, highlighting the importance of spatial context and understanding.
Paper Structure (17 sections, 9 equations, 3 figures, 3 tables)

This paper contains 17 sections, 9 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: a An example of a WSI. b A 1344$\times$1344 micron region of the WSI. c, d, e Successively zoomed-in 224$\times$224 pixel regions of panel $\textbf{b}$ corresponding to 5$\times$, 10$\times$, 20$\times$, respectively. These regions depict the various features at various magnification ranging from tissue organization to individual cells.
  • Figure 2: An overview of the pretraining framework for the mixing encoder. The basic setting takes masked embeddings from 5$\times$, 10$\times$, and 20$\times$ magnifications corresponding to a 3$\times$3 region at 5$\times$ magnification as input. a An encoder-decoder architecture trained using a masked reconstruction loss on patch embeddings. b An encoder-projector architecture using crops from a larger context as input augmentations trained with a contrastive loss on CLS embeddings denoted with *. During experimentation, two aspects of the masked reconstruction base setup are investigated: the masking rate and the addition of the contrastive branch.
  • Figure 3: Summarized results of various aspects of the pretraining framework as measured by difference in AUROC (higher is better). "Pretrained" refers to MEM and CMEM results averaged over biomarker tasks and removal ratio $r$ and source size $c$, respectively. Random refers to the setting with no pretraining, averaged over possible magnifications. Overall, we observe improvements by using pretraining whether compared to AB-MIL or a randomly initialized model. MEM generally outperforms CMEM and the contextualized region (Patch) embeddings outperformed the compressed (CLS) embedding.