Table of Contents
Fetching ...

PathVQ: Reforming Computational Pathology Foundation Model for Whole Slide Image Analysis via Vector Quantization

Honglin Li, Zhongyi Shui, Yunlong Zhang, Chenglu Zhu, Lin Yang

TL;DR

PathVQ tackles the information loss inherent in CLS-based tile representations for whole-slide image analysis by introducing vector quantization distillation of patch tokens. The framework combines patch-level VQ with a Multi-Scale VQ (MSVQ) that unifies patch- and tile-level features and provides an offline tokenizer for slide-level self-supervised learning, enabling stable pretraining and efficient WSI modeling. Through VQ pretraining and MSVQ-based SSL, PathVQ achieves state-of-the-art performance on multiple downstream tasks, including tile/ROI classification, WSI tumor classification, and survival prediction, while dramatically reducing token storage and computation. The approach demonstrates that preserving spatial information via quantized tokens, rather than discarding it with CLS, yields more discriminative WSI representations and practical gains for cancer diagnosis and prognosis.

Abstract

Computational pathology and whole-slide image (WSI) analysis are pivotal in cancer diagnosis and prognosis. However, the ultra-high resolution of WSIs presents significant modeling challenges. Recent advancements in pathology foundation models have improved performance, yet most approaches rely on [CLS] token representation of tile ViT as slide-level inputs (16x16 pixels is refereed as patch and 224x224 pixels as tile). This discards critical spatial details from patch tokens, limiting downstream WSI analysis tasks. We find that leveraging all spatial patch tokens benefits WSI analysis but incurs nearly 200x higher storage and training costs (e.g., 196 tokens in ViT$_{224}$). To address this, we introduce vector quantized (VQ) distillation on patch feature, which efficiently compresses spatial patch tokens using discrete indices and a decoder. Our method reduces token dimensionality from 1024 to 16, achieving a 64x compression rate while preserving reconstruction fidelity. Furthermore, we employ a multi-scale VQ (MSVQ) strategy, which not only enhances VQ reconstruction performance but also serves as a Self-supervised Learning (SSL) supervision for a seamless slide-level pretraining objective. Built upon the quantized patch features and supervision targets of tile via MSVQ, we develop a progressive convolutional module and slide-level SSL to extract representations with rich spatial-information for downstream WSI tasks. Extensive evaluations on multiple datasets demonstrate the effectiveness of our approach, achieving state-of-the-art performance in WSI analysis. Code will be available soon.

PathVQ: Reforming Computational Pathology Foundation Model for Whole Slide Image Analysis via Vector Quantization

TL;DR

PathVQ tackles the information loss inherent in CLS-based tile representations for whole-slide image analysis by introducing vector quantization distillation of patch tokens. The framework combines patch-level VQ with a Multi-Scale VQ (MSVQ) that unifies patch- and tile-level features and provides an offline tokenizer for slide-level self-supervised learning, enabling stable pretraining and efficient WSI modeling. Through VQ pretraining and MSVQ-based SSL, PathVQ achieves state-of-the-art performance on multiple downstream tasks, including tile/ROI classification, WSI tumor classification, and survival prediction, while dramatically reducing token storage and computation. The approach demonstrates that preserving spatial information via quantized tokens, rather than discarding it with CLS, yields more discriminative WSI representations and practical gains for cancer diagnosis and prognosis.

Abstract

Computational pathology and whole-slide image (WSI) analysis are pivotal in cancer diagnosis and prognosis. However, the ultra-high resolution of WSIs presents significant modeling challenges. Recent advancements in pathology foundation models have improved performance, yet most approaches rely on [CLS] token representation of tile ViT as slide-level inputs (16x16 pixels is refereed as patch and 224x224 pixels as tile). This discards critical spatial details from patch tokens, limiting downstream WSI analysis tasks. We find that leveraging all spatial patch tokens benefits WSI analysis but incurs nearly 200x higher storage and training costs (e.g., 196 tokens in ViT). To address this, we introduce vector quantized (VQ) distillation on patch feature, which efficiently compresses spatial patch tokens using discrete indices and a decoder. Our method reduces token dimensionality from 1024 to 16, achieving a 64x compression rate while preserving reconstruction fidelity. Furthermore, we employ a multi-scale VQ (MSVQ) strategy, which not only enhances VQ reconstruction performance but also serves as a Self-supervised Learning (SSL) supervision for a seamless slide-level pretraining objective. Built upon the quantized patch features and supervision targets of tile via MSVQ, we develop a progressive convolutional module and slide-level SSL to extract representations with rich spatial-information for downstream WSI tasks. Extensive evaluations on multiple datasets demonstrate the effectiveness of our approach, achieving state-of-the-art performance in WSI analysis. Code will be available soon.

Paper Structure

This paper contains 28 sections, 7 equations, 9 figures, 2 tables, 1 algorithm.

Figures (9)

  • Figure 1: Evaluation on information loss via reconstruction training. Directly using the [CLS] token results in significant information loss, making it difficult to reconstruct all patch tokens, potentially discarding critical details for downstream tasks. In contrast, vector quantization retains more original information.
  • Figure 2: Overview of the proposed framework. (a) The pipeline for compressing spatial patch tokens using vector quantization (VQ) and multi-scale VQ (MSVQ). (b) Slide-level self-supervised learning (SSL) using MSVQ-generated tokenizers. (c) Downstream WSI task fine-tuning with compressed patch features.
  • Figure 3: Multi-scale Vector Quantization (MSVQ) visualization. Based on MSVQ, the tile- and patch-level can be quantified simultaneously for slide-level pretraining and feature compression, respectively. The region data can be used to pretrain ABMIL via token index frequency matching.
  • Figure 4: ROI classification. Obviously, further fine-tuning (FT) on the last block of UNI ViT can further improve the downstream results. PathVQ, by compressing and reconstructing the patch spatial token, can achieve comparable improvement. UNI-2, however, does not show consistency improvement compared to FT and PathVQ.
  • Figure 5: The PathVQ feature with Convs need pretraining and aligning tile's level-0 feature to attain good feature space and compression. RD: random initialize Convs. FT: fine-tuning during WSI analysis. PT: pretrained during VQ learning (align to level-0 tile feature). FZ: freeze during WSI analysis.
  • ...and 4 more figures