Table of Contents
Fetching ...

ZeroSlide: Is Zero-Shot Classification Adequate for Lifelong Learning in Whole-Slide Image Analysis in the Era of Pathology Vision-Language Foundation Models?

Doanh C. Bui, Hoai Luan Pham, Vu Trung Duong Le, Tuan Hai Vu, Van Duy Tran, Yasuhiko Nakashima

TL;DR

This work tackles lifelong learning for whole-slide image analysis by comparing training-based continual-learning methods against a zero-shot lifelong-learning approach, ZeroSlide, that leverages pathology vision-language foundations and class prompts. By tiling WSIs, extracting patch features with TITAN, and aggregating via a learnable or pretrained slide encoder, the study frames lifelong learning in both CLASS-IL and TASK-IL settings and constructs prototype-based class templates for zero-shot classification. Experimental results across six TCGA datasets show ZeroSlide is highly competitive with rehearsal-based methods, often matching or surpassing regularization approaches while offering training-free inference and no online storage needs; ConSlide remains strongest in CLASS-IL, but ZeroSlide demonstrates robust performance and stability in BWT and Forgetting metrics. The findings indicate zero-shot lifelong learning is a viable and practical alternative in pathology, with opportunities to further improve confidence and integrate class-template ideas with continual-learning strategies to maximize clinical utility.

Abstract

Lifelong learning for whole slide images (WSIs) poses the challenge of training a unified model to perform multiple WSI-related tasks, such as cancer subtyping and tumor classification, in a distributed, continual fashion. This is a practical and applicable problem in clinics and hospitals, as WSIs are large, require storage, processing, and transfer time. Training new models whenever new tasks are defined is time-consuming. Recent work has applied regularization- and rehearsal-based methods to this setting. However, the rise of vision-language foundation models that align diagnostic text with pathology images raises the question: are these models alone sufficient for lifelong WSI learning using zero-shot classification, or is further investigation into continual learning strategies needed to improve performance? To our knowledge, this is the first study to compare conventional continual-learning approaches with vision-language zero-shot classification for WSIs. Our source code and experimental results will be available soon.

ZeroSlide: Is Zero-Shot Classification Adequate for Lifelong Learning in Whole-Slide Image Analysis in the Era of Pathology Vision-Language Foundation Models?

TL;DR

This work tackles lifelong learning for whole-slide image analysis by comparing training-based continual-learning methods against a zero-shot lifelong-learning approach, ZeroSlide, that leverages pathology vision-language foundations and class prompts. By tiling WSIs, extracting patch features with TITAN, and aggregating via a learnable or pretrained slide encoder, the study frames lifelong learning in both CLASS-IL and TASK-IL settings and constructs prototype-based class templates for zero-shot classification. Experimental results across six TCGA datasets show ZeroSlide is highly competitive with rehearsal-based methods, often matching or surpassing regularization approaches while offering training-free inference and no online storage needs; ConSlide remains strongest in CLASS-IL, but ZeroSlide demonstrates robust performance and stability in BWT and Forgetting metrics. The findings indicate zero-shot lifelong learning is a viable and practical alternative in pathology, with opportunities to further improve confidence and integrate class-template ideas with continual-learning strategies to maximize clinical utility.

Abstract

Lifelong learning for whole slide images (WSIs) poses the challenge of training a unified model to perform multiple WSI-related tasks, such as cancer subtyping and tumor classification, in a distributed, continual fashion. This is a practical and applicable problem in clinics and hospitals, as WSIs are large, require storage, processing, and transfer time. Training new models whenever new tasks are defined is time-consuming. Recent work has applied regularization- and rehearsal-based methods to this setting. However, the rise of vision-language foundation models that align diagnostic text with pathology images raises the question: are these models alone sufficient for lifelong WSI learning using zero-shot classification, or is further investigation into continual learning strategies needed to improve performance? To our knowledge, this is the first study to compare conventional continual-learning approaches with vision-language zero-shot classification for WSIs. Our source code and experimental results will be available soon.

Paper Structure

This paper contains 13 sections, 3 figures, 1 table, 1 algorithm.

Figures (3)

  • Figure 1: Regularization-based and rehearsal-based methods require retraining when adding tasks, while zero-shot classification with a pathology vision-language model only needs new class templates, making it training-free. This study compares the performance of lifelong learning with training-free zero-shot classification to training-based continual learning methods.
  • Figure 2: Distribution of the sequence of six TCGA datasets.
  • Figure 3: Confidence scores for target cancer subtype labels after training/inference on the final tasks of ZeroSlide and all continual-learning-based models.