An efficient framework based on large foundation model for cervical cytopathology whole slide image screening
Jialong Huang, Gaojie Li, Shichao Kan, Jianfeng Liu, Yixiong Liang
TL;DR
The paper tackles cervical cytopathology WSI screening by removing the need for lesion-level annotations and leveraging only WSI-level labels. It introduces a two-stage framework: first, a mean-pooling based patch-filter selects top-$k$ high-risk patches from a frozen foundation model; second, a parameter-efficient fine-tuning (PEFT) step via contrastive learning with a linear adapter adapts the foundation model to cervical imagery. The resulting patch representations are fed into embedding-based MIL, achieving state-of-the-art results on the CSD dataset and strong performance on FNAC 2019, while significantly reducing training time and memory usage. This approach offers a scalable, detection-free pathway for WSI screening and can potentially extend to broader histopathology tasks, albeit with remaining challenges in interpretability and complexity.
Abstract
Current cervical cytopathology whole slide image (WSI) screening primarily relies on detection-based approaches, which are limited in performance due to the expense and time-consuming annotation process. Multiple Instance Learning (MIL), a weakly supervised approach that relies solely on bag-level labels, can effectively alleviate these challenges. Nonetheless, MIL commonly employs frozen pretrained models or self-supervised learning for feature extraction, which suffers from low efficacy or inefficiency. In this paper, we propose an efficient framework for cervical cytopathology WSI classification using only WSI-level labels through unsupervised and weakly supervised learning. Given the sparse and dispersed nature of abnormal cells within cytopathological WSIs, we propose a strategy that leverages the pretrained foundation model to filter the top$k$ high-risk patches. Subsequently, we suggest parameter-efficient fine-tuning (PEFT) of a large foundation model using contrastive learning on the filtered patches to enhance its representation ability for task-specific signals. By training only the added linear adapters, we enhance the learning of patch-level features with substantially reduced time and memory consumption. Experiments conducted on the CSD and FNAC 2019 datasets demonstrate that the proposed method enhances the performance of various MIL methods and achieves state-of-the-art (SOTA) performance. The code and trained models are publicly available at https://github.com/CVIU-CSU/TCT-InfoNCE.
