Pathryoshka: Compressing Pathology Foundation Models via Multi-Teacher Knowledge Distillation with Nested Embeddings

Christian Grashei; Christian Brechenmacher; Rao Muhammad Umer; Jingsong Liu; Carsten Marr; Ewa Szczurek; Peter J. Schüffler

Pathryoshka: Compressing Pathology Foundation Models via Multi-Teacher Knowledge Distillation with Nested Embeddings

Christian Grashei, Christian Brechenmacher, Rao Muhammad Umer, Jingsong Liu, Carsten Marr, Ewa Szczurek, Peter J. Schüffler

TL;DR

Pathryoshka tackles the deployment barriers of pathology foundation models by combining multi-teacher knowledge distillation with Matryoshka-style nested embeddings, producing compact ViT students that retain high performance. It uses a large unlabeled tile dataset and a cropping-based augmentation strategy to fuse signals from three strong pathology FMs (Virchow2, UNI2-h, H-optimus-1) into Pathryoshka-B (87M) and Pathryoshka-S (22M). The approach achieves 86–92% parameter reduction with comparable or superior accuracy to large teachers across ten benchmarks and enables efficient down-stream use via adaptable embedding sizes, including robust patch retrieval and KNN tasks. Limitations include data from a single institution and residual teacher biases, suggesting future work on broader data sources and more diverse teacher ensembles to maximize generalization and clinical readiness.

Abstract

Pathology foundation models (FMs) have driven significant progress in computational pathology. However, these high-performing models can easily exceed a billion parameters and produce high-dimensional embeddings, thus limiting their applicability for research or clinical use when computing resources are tight. Here, we introduce Pathryoshka, a multi-teacher distillation framework inspired by RADIO distillation and Matryoshka Representation Learning to reduce pathology FM sizes while allowing for adaptable embedding dimensions. We evaluate our framework with a distilled model on ten public pathology benchmarks with varying downstream tasks. Compared to its much larger teachers, Pathryoshka reduces the model size by 86-92% at on-par performance. It outperforms state-of-the-art single-teacher distillation models of comparable size by a median margin of 7.0 in accuracy. By enabling efficient local deployment without sacrificing accuracy or representational richness, Pathryoshka democratizes access to state-of-the-art pathology FMs for the broader research and clinical community.

Pathryoshka: Compressing Pathology Foundation Models via Multi-Teacher Knowledge Distillation with Nested Embeddings

TL;DR

Abstract

Pathryoshka: Compressing Pathology Foundation Models via Multi-Teacher Knowledge Distillation with Nested Embeddings

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)