Table of Contents
Fetching ...

MergeSlide: Continual Model Merging and Task-to-Class Prompt-Aligned Inference for Lifelong Learning on Whole Slide Images

Doanh C. Bui, Ba Hung Ngo, Hoai Luan Pham, Khang Nguyen, Maï K. Nguyen, Yasuhiko Nakashima

TL;DR

MergeSlide tackles lifelong learning on gigapixel WSIs by reframing continual learning as offline task-specific model merging using a vision-language pathology foundation. Each new cancer task is fine-tuned with an MLP-free backbone on class-aware prompts and then merged into a single unified model via an orthogonal projection strategy that preserves prior knowledge while incorporating new information. A key contribution is Task-to-Class Prompt-aligned (TCP) inference, which enables CLASS-IL by first identifying the most relevant task through task-level prompts and then applying the corresponding class-level prompts for classification. Evaluations on six TCGA cohorts show MergeSlide outperforming rehearsal-based and vision-language zero-shot baselines under both CLASS-IL and TASK-IL, with robust performance under varying task orders and domain shifts, illustrating practical feasibility for scalable, privacy-preserving WSI lifelong learning. The approach delivers a principled, data-efficient path to continually expand WSI capabilities without storing raw data or retraining entire models from scratch, making it well-suited for clinical deployment and cross-institution collaborations.

Abstract

Lifelong learning on Whole Slide Images (WSIs) aims to train or fine-tune a unified model sequentially on cancer-related tasks, reducing the resources and effort required for data transfer and processing, especially given the gigabyte-scale size of WSIs. In this paper, we introduce MergeSlide, a simple yet effective framework that treats lifelong learning as a model merging problem by leveraging a vision-language pathology foundation model. When a new task arrives, it is: 1) defined with class-aware prompts, 2) fine-tuned for a few epochs using an MLP-free backbone, and 3) merged into a unified model using an orthogonal continual merging strategy that preserves performance and mitigates catastrophic forgetting. For inference under the class-incremental learning (CLASS-IL) setting, where task identity is unknown, we introduce Task-to-Class Prompt-aligned (TCP) inference. Specifically, TCP first identifies the most relevant task using task-level prompts and then applies the corresponding class-aware prompts to generate predictions. To evaluate MergeSlide, we conduct experiments on a stream of six TCGA datasets. The results show that MergeSlide outperforms both rehearsal-based continual learning and vision-language zero-shot baselines. Code and data are available at https://github.com/caodoanh2001/MergeSlide.

MergeSlide: Continual Model Merging and Task-to-Class Prompt-Aligned Inference for Lifelong Learning on Whole Slide Images

TL;DR

MergeSlide tackles lifelong learning on gigapixel WSIs by reframing continual learning as offline task-specific model merging using a vision-language pathology foundation. Each new cancer task is fine-tuned with an MLP-free backbone on class-aware prompts and then merged into a single unified model via an orthogonal projection strategy that preserves prior knowledge while incorporating new information. A key contribution is Task-to-Class Prompt-aligned (TCP) inference, which enables CLASS-IL by first identifying the most relevant task through task-level prompts and then applying the corresponding class-level prompts for classification. Evaluations on six TCGA cohorts show MergeSlide outperforming rehearsal-based and vision-language zero-shot baselines under both CLASS-IL and TASK-IL, with robust performance under varying task orders and domain shifts, illustrating practical feasibility for scalable, privacy-preserving WSI lifelong learning. The approach delivers a principled, data-efficient path to continually expand WSI capabilities without storing raw data or retraining entire models from scratch, making it well-suited for clinical deployment and cross-institution collaborations.

Abstract

Lifelong learning on Whole Slide Images (WSIs) aims to train or fine-tune a unified model sequentially on cancer-related tasks, reducing the resources and effort required for data transfer and processing, especially given the gigabyte-scale size of WSIs. In this paper, we introduce MergeSlide, a simple yet effective framework that treats lifelong learning as a model merging problem by leveraging a vision-language pathology foundation model. When a new task arrives, it is: 1) defined with class-aware prompts, 2) fine-tuned for a few epochs using an MLP-free backbone, and 3) merged into a unified model using an orthogonal continual merging strategy that preserves performance and mitigates catastrophic forgetting. For inference under the class-incremental learning (CLASS-IL) setting, where task identity is unknown, we introduce Task-to-Class Prompt-aligned (TCP) inference. Specifically, TCP first identifies the most relevant task using task-level prompts and then applies the corresponding class-aware prompts to generate predictions. To evaluate MergeSlide, we conduct experiments on a stream of six TCGA datasets. The results show that MergeSlide outperforms both rehearsal-based continual learning and vision-language zero-shot baselines. Code and data are available at https://github.com/caodoanh2001/MergeSlide.

Paper Structure

This paper contains 11 sections, 8 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: (a) Rehearsal-based methods; (b) Task-specific fine-tuning and merging in MergeSlide; MergeSlide first trains an MLP-free model offline using frozen class-aware prompt embeddings. Then, the weights are merged task-by-task.
  • Figure 2: Performance comparison of MergeSlide with other continual learning methods. (a) Accuracy as tasks accumulate. (b) Accuracy vs. Forgetting. (c) Accuracy vs. Backward Transfer, both after the final task.
  • Figure 3: Illustration of MergeSlide. For each new task $t$, class-aware prompt embeddings $E^{\mathcal{C},t}$ are created. An MLP-free fine-tuning step (Sec. \ref{['sec:mlp-free']}) initializes the slide aggregator $f_{\mathcal{A}}$ from base weights $\theta_{base}$, yielding task-specific weights $\theta_t$. These are then fused via online continual model merging (Sec. \ref{['sec:cmm']}) into unified weights $\tilde{\theta}_{1:t}$ capable of handling all tasks up to $t$. Finally, Task-to-class prompt-aligned inference (Sec. \ref{['sec:pai']}) identifies the relevant task using task-level prompts and predicts based on class-aware embeddings.
  • Figure 4: Performance drop comparisons across different methods when new tasks are added on TCGA-BRCA and TCGA-RCC.
  • Figure 5: t-SNE van2008visualizing visualization of slide embeddings from MergeSlide and three comparative methods, organized by task and class spaces. The dashed green box highlights MergeSlide’s superior clustering, compared to the regions marked by dashed red boxes in other methods.