PathVQ: Reforming Computational Pathology Foundation Model for Whole Slide Image Analysis via Vector Quantization
Honglin Li, Zhongyi Shui, Yunlong Zhang, Chenglu Zhu, Lin Yang
TL;DR
PathVQ tackles the information loss inherent in CLS-based tile representations for whole-slide image analysis by introducing vector quantization distillation of patch tokens. The framework combines patch-level VQ with a Multi-Scale VQ (MSVQ) that unifies patch- and tile-level features and provides an offline tokenizer for slide-level self-supervised learning, enabling stable pretraining and efficient WSI modeling. Through VQ pretraining and MSVQ-based SSL, PathVQ achieves state-of-the-art performance on multiple downstream tasks, including tile/ROI classification, WSI tumor classification, and survival prediction, while dramatically reducing token storage and computation. The approach demonstrates that preserving spatial information via quantized tokens, rather than discarding it with CLS, yields more discriminative WSI representations and practical gains for cancer diagnosis and prognosis.
Abstract
Computational pathology and whole-slide image (WSI) analysis are pivotal in cancer diagnosis and prognosis. However, the ultra-high resolution of WSIs presents significant modeling challenges. Recent advancements in pathology foundation models have improved performance, yet most approaches rely on [CLS] token representation of tile ViT as slide-level inputs (16x16 pixels is refereed as patch and 224x224 pixels as tile). This discards critical spatial details from patch tokens, limiting downstream WSI analysis tasks. We find that leveraging all spatial patch tokens benefits WSI analysis but incurs nearly 200x higher storage and training costs (e.g., 196 tokens in ViT$_{224}$). To address this, we introduce vector quantized (VQ) distillation on patch feature, which efficiently compresses spatial patch tokens using discrete indices and a decoder. Our method reduces token dimensionality from 1024 to 16, achieving a 64x compression rate while preserving reconstruction fidelity. Furthermore, we employ a multi-scale VQ (MSVQ) strategy, which not only enhances VQ reconstruction performance but also serves as a Self-supervised Learning (SSL) supervision for a seamless slide-level pretraining objective. Built upon the quantized patch features and supervision targets of tile via MSVQ, we develop a progressive convolutional module and slide-level SSL to extract representations with rich spatial-information for downstream WSI tasks. Extensive evaluations on multiple datasets demonstrate the effectiveness of our approach, achieving state-of-the-art performance in WSI analysis. Code will be available soon.
