Table of Contents
Fetching ...

Pathryoshka: Compressing Pathology Foundation Models via Multi-Teacher Knowledge Distillation with Nested Embeddings

Christian Grashei, Christian Brechenmacher, Rao Muhammad Umer, Jingsong Liu, Carsten Marr, Ewa Szczurek, Peter J. Schüffler

TL;DR

Pathryoshka tackles the deployment barriers of pathology foundation models by combining multi-teacher knowledge distillation with Matryoshka-style nested embeddings, producing compact ViT students that retain high performance. It uses a large unlabeled tile dataset and a cropping-based augmentation strategy to fuse signals from three strong pathology FMs (Virchow2, UNI2-h, H-optimus-1) into Pathryoshka-B (87M) and Pathryoshka-S (22M). The approach achieves 86–92% parameter reduction with comparable or superior accuracy to large teachers across ten benchmarks and enables efficient down-stream use via adaptable embedding sizes, including robust patch retrieval and KNN tasks. Limitations include data from a single institution and residual teacher biases, suggesting future work on broader data sources and more diverse teacher ensembles to maximize generalization and clinical readiness.

Abstract

Pathology foundation models (FMs) have driven significant progress in computational pathology. However, these high-performing models can easily exceed a billion parameters and produce high-dimensional embeddings, thus limiting their applicability for research or clinical use when computing resources are tight. Here, we introduce Pathryoshka, a multi-teacher distillation framework inspired by RADIO distillation and Matryoshka Representation Learning to reduce pathology FM sizes while allowing for adaptable embedding dimensions. We evaluate our framework with a distilled model on ten public pathology benchmarks with varying downstream tasks. Compared to its much larger teachers, Pathryoshka reduces the model size by 86-92% at on-par performance. It outperforms state-of-the-art single-teacher distillation models of comparable size by a median margin of 7.0 in accuracy. By enabling efficient local deployment without sacrificing accuracy or representational richness, Pathryoshka democratizes access to state-of-the-art pathology FMs for the broader research and clinical community.

Pathryoshka: Compressing Pathology Foundation Models via Multi-Teacher Knowledge Distillation with Nested Embeddings

TL;DR

Pathryoshka tackles the deployment barriers of pathology foundation models by combining multi-teacher knowledge distillation with Matryoshka-style nested embeddings, producing compact ViT students that retain high performance. It uses a large unlabeled tile dataset and a cropping-based augmentation strategy to fuse signals from three strong pathology FMs (Virchow2, UNI2-h, H-optimus-1) into Pathryoshka-B (87M) and Pathryoshka-S (22M). The approach achieves 86–92% parameter reduction with comparable or superior accuracy to large teachers across ten benchmarks and enables efficient down-stream use via adaptable embedding sizes, including robust patch retrieval and KNN tasks. Limitations include data from a single institution and residual teacher biases, suggesting future work on broader data sources and more diverse teacher ensembles to maximize generalization and clinical readiness.

Abstract

Pathology foundation models (FMs) have driven significant progress in computational pathology. However, these high-performing models can easily exceed a billion parameters and produce high-dimensional embeddings, thus limiting their applicability for research or clinical use when computing resources are tight. Here, we introduce Pathryoshka, a multi-teacher distillation framework inspired by RADIO distillation and Matryoshka Representation Learning to reduce pathology FM sizes while allowing for adaptable embedding dimensions. We evaluate our framework with a distilled model on ten public pathology benchmarks with varying downstream tasks. Compared to its much larger teachers, Pathryoshka reduces the model size by 86-92% at on-par performance. It outperforms state-of-the-art single-teacher distillation models of comparable size by a median margin of 7.0 in accuracy. By enabling efficient local deployment without sacrificing accuracy or representational richness, Pathryoshka democratizes access to state-of-the-art pathology FMs for the broader research and clinical community.

Paper Structure

This paper contains 24 sections, 10 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Although smaller in parameter size (see legend), our model outperforms or is on par with state of the art pathology foundation models on various benchmarks. If models of one family possess insufficiencies in one area (here H-optimus-1 in breast cancer histopathology image analysis), a single-teacher distillation model of the same family (H0-mini) possesses the same insufficiencies.
  • Figure 2: Overview of the proposed Pathryoshka, a multi-teacher distillation framework. The student model and distillation heads are jointly trained while teacher models remain frozen. Input images are augmented and processed by both the student and all teachers. The CLS and spatial feature heads align student representations with those of the teachers. Each head includes multiple MLP projection layers that form nested embeddings by minimizing a similarity loss between student and teacher projections.
  • Figure 3: A random crop is obtained from the input image. This aligned crop is individually color augmented for the student and all teachers. We compute the CLS and patch loss for this aligned crop to make sure that the patch tokens are aligned for all models. Additionally, each model receives a second random individual crop which aims to improve magnification invariance. We only compute CLS loss on this non-aligned crop.
  • Figure 4: Comparison of k-NN classification accuracy between Pathryoshka-B, the best-performing teacher model UNI2-h, and the baseline H0-mini, across three multiclass classification benchmarks. Each point corresponds to a subembedding of decreasing dimensionality (e.g., “$\frac{\text{dim}}{2}$” denotes using only half of the full embedding size). Colors indicate different datasets. Line types denote the respective models. As a result of the nested embedding structure, our model maintains high accuracy even when using smaller subembeddings. Note that UNI2-h uses 1536-dimensional embeddings, while our model and H0-mini use 768-dimensional embeddings.
  • Figure 5: Example patch embedding visualization. While both models create an interpretable embedding representation, our model is able to retain the semantical representation throughout the truncation process. The different cells are still distinguishable in higher compression. The colors in the visualization correspond to the first three principal components obtained through PCA applied to the high-dimensional embeddings. Each RGB channel represents a principal axis capturing the most significant variance in the embedding space.
  • ...and 2 more figures