MergeSlide: Continual Model Merging and Task-to-Class Prompt-Aligned Inference for Lifelong Learning on Whole Slide Images

Doanh C. Bui; Ba Hung Ngo; Hoai Luan Pham; Khang Nguyen; Maï K. Nguyen; Yasuhiko Nakashima

MergeSlide: Continual Model Merging and Task-to-Class Prompt-Aligned Inference for Lifelong Learning on Whole Slide Images

Doanh C. Bui, Ba Hung Ngo, Hoai Luan Pham, Khang Nguyen, Maï K. Nguyen, Yasuhiko Nakashima

TL;DR

MergeSlide tackles lifelong learning on gigapixel WSIs by reframing continual learning as offline task-specific model merging using a vision-language pathology foundation. Each new cancer task is fine-tuned with an MLP-free backbone on class-aware prompts and then merged into a single unified model via an orthogonal projection strategy that preserves prior knowledge while incorporating new information. A key contribution is Task-to-Class Prompt-aligned (TCP) inference, which enables CLASS-IL by first identifying the most relevant task through task-level prompts and then applying the corresponding class-level prompts for classification. Evaluations on six TCGA cohorts show MergeSlide outperforming rehearsal-based and vision-language zero-shot baselines under both CLASS-IL and TASK-IL, with robust performance under varying task orders and domain shifts, illustrating practical feasibility for scalable, privacy-preserving WSI lifelong learning. The approach delivers a principled, data-efficient path to continually expand WSI capabilities without storing raw data or retraining entire models from scratch, making it well-suited for clinical deployment and cross-institution collaborations.

Abstract

Lifelong learning on Whole Slide Images (WSIs) aims to train or fine-tune a unified model sequentially on cancer-related tasks, reducing the resources and effort required for data transfer and processing, especially given the gigabyte-scale size of WSIs. In this paper, we introduce MergeSlide, a simple yet effective framework that treats lifelong learning as a model merging problem by leveraging a vision-language pathology foundation model. When a new task arrives, it is: 1) defined with class-aware prompts, 2) fine-tuned for a few epochs using an MLP-free backbone, and 3) merged into a unified model using an orthogonal continual merging strategy that preserves performance and mitigates catastrophic forgetting. For inference under the class-incremental learning (CLASS-IL) setting, where task identity is unknown, we introduce Task-to-Class Prompt-aligned (TCP) inference. Specifically, TCP first identifies the most relevant task using task-level prompts and then applies the corresponding class-aware prompts to generate predictions. To evaluate MergeSlide, we conduct experiments on a stream of six TCGA datasets. The results show that MergeSlide outperforms both rehearsal-based continual learning and vision-language zero-shot baselines. Code and data are available at https://github.com/caodoanh2001/MergeSlide.

MergeSlide: Continual Model Merging and Task-to-Class Prompt-Aligned Inference for Lifelong Learning on Whole Slide Images

TL;DR

Abstract

MergeSlide: Continual Model Merging and Task-to-Class Prompt-Aligned Inference for Lifelong Learning on Whole Slide Images

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)