Table of Contents
Fetching ...

MedCL-Bench: Benchmarking stability-efficiency trade-offs and scaling in biomedical continual learning

Min Zeng, Shuang Zhou, Zaifu Zhan, Rui Zhang

Abstract

Medical language models must be updated as evidence and terminology evolve, yet sequential updating can trigger catastrophic forgetting. Although biomedical NLP has many static benchmarks, no unified, task-diverse benchmark exists for evaluating continual learning under standardized protocols, robustness to task order and compute-aware reporting. We introduce MedCL-Bench, which streams ten biomedical NLP datasets spanning five task families and evaluates eleven continual learning strategies across eight task orders, reporting retention, transfer, and GPU-hour cost. Across backbones and task orders, direct sequential fine-tuning on incoming tasks induces catastrophic forgetting, causing update-induced performance regressions on prior tasks. Continual learning methods occupy distinct retention-compute frontiers: parameter-isolation provides the best retention per GPU-hour, replay offers strong protection at higher cost, and regularization yields limited benefit. Forgetting is task-dependent, with multi-label topic classification most vulnerable and constrained-output tasks more robust. MedCL-Bench provides a reproducible framework for auditing model updates before deployment.

MedCL-Bench: Benchmarking stability-efficiency trade-offs and scaling in biomedical continual learning

Abstract

Medical language models must be updated as evidence and terminology evolve, yet sequential updating can trigger catastrophic forgetting. Although biomedical NLP has many static benchmarks, no unified, task-diverse benchmark exists for evaluating continual learning under standardized protocols, robustness to task order and compute-aware reporting. We introduce MedCL-Bench, which streams ten biomedical NLP datasets spanning five task families and evaluates eleven continual learning strategies across eight task orders, reporting retention, transfer, and GPU-hour cost. Across backbones and task orders, direct sequential fine-tuning on incoming tasks induces catastrophic forgetting, causing update-induced performance regressions on prior tasks. Continual learning methods occupy distinct retention-compute frontiers: parameter-isolation provides the best retention per GPU-hour, replay offers strong protection at higher cost, and regularization yields limited benefit. Forgetting is task-dependent, with multi-label topic classification most vulnerable and constrained-output tasks more robust. MedCL-Bench provides a reproducible framework for auditing model updates before deployment.
Paper Structure (7 sections, 3 equations, 6 figures, 1 table)

This paper contains 7 sections, 3 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Overview of MedCL-Bench.(a) Biomedical knowledge and datasets evolve continuously (e.g., new literature and drug--disease relations), creating realistic sequential update streams---both across institutions (where data cannot be pooled) and within an institution over time. (b) Sequential sequential fine-tuning can overwrite previously acquired capabilities (catastrophic forgetting), whereas CL aims to retain prior knowledge while learning new tasks. (c) MedCL-Bench comprises ten biomedical NLP datasets grouped into five task families (QA, fact checking, relation extraction, document classification, and multi-label topic classification). (d) Benchmark workflow: a pretrained backbone is updated sequentially on a task stream under multiple task orders, and evaluated on all previously seen tasks after each stage. (e) CL metrics reported in this work: overall task performance (AP), backward transfer (BWT), and forward transfer (FWT). (f) Key questions addressed: forgetting severity, method comparison, order sensitivity, compute efficiency, and scaling/backbone dependence. Icons are sourced from Flaticon.com (full attributions in Supplementary Note \ref{['supply:icons']}).
  • Figure 2: Order robustness and statistical reliability on MedCL-Bench (T5-base). (a) Mean final AP across eight randomized task orders with 95% bootstrap confidence intervals for the mean obtained by resampling task orders (n=8). (b) Order sensitivity measured as the standard deviation (s.d.) of final AP across orders (lower indicates stronger robustness). Together, these panels summarize both average performance and sensitivity to task-order permutations.
  • Figure 3: Order-dependent forgetting dynamics in MedCL-Bench.(a) Forgetting curves. (b) Transition shock heatmaps (next page).
  • Figure 4: Task-family forgetting distributions by method. For each task, forgetting is measured as $\Delta = a_{t}^{\text{post}} - a_{t}^{\text{end}}$ (percentage points), i.e., the accuracy difference from immediately after learning the task to the end of the 10-task stream. Tasks are grouped into five families and distributions are aggregated across task orders. Boxes summarize the distribution over (order, task) instances and points show individual observations. We plot the clipped forgetting $\max(\Delta,0)$ to isolate performance loss without allowing gains (e.g., backward transfer) to offset forgetting.
  • Figure 5: Backbone scaling reshapes performance and efficiency trade-offs (Order 1).(a) Final AP after the 10-task stream on T5-base, Qwen-0.6B and Qwen-4B. (b) Absolute training cost of VANILLA in GPU-hours for each backbone. (c--e) AP versus relative training cost for T5-base, Qwen-0.6B and Qwen-4B, where cost is normalized by the corresponding VANILLA run on the same backbone.
  • ...and 1 more figures