Table of Contents
Fetching ...

Embracing Massive Medical Data

Yu-Cheng Chou, Zongwei Zhou, Alan Yuille

TL;DR

This work tackles the challenge of training AI on massive, streaming, partially labeled medical data without revisiting old data, aiming to avoid catastrophic forgetting. It introduces an online learning framework with Linear Memory, Dynamic Memory, and Selective Memory that stores recent samples, deduplicates memory, and prioritizes high-uncertainty samples using entropy-based weighting. Empirical results on a single-site abdominal CT dataset and a sequential, multi-dataset site demonstrate data efficiency comparable to multi-pass baselines, notable forgetting mitigation, and performance gains up to the order of several Dice points, especially for small structures. The approach offers a practical path toward continual, resource-efficient learning in clinical environments and sets the stage for future work in annotation-free and fine-grained continual learning for medical imaging.

Abstract

As massive medical data become available with an increasing number of scans, expanding classes, and varying sources, prevalent training paradigms -- where AI is trained with multiple passes over fixed, finite datasets -- face significant challenges. First, training AI all at once on such massive data is impractical as new scans/sources/classes continuously arrive. Second, training AI continuously on new scans/sources/classes can lead to catastrophic forgetting, where AI forgets old data as it learns new data, and vice versa. To address these two challenges, we propose an online learning method that enables training AI from massive medical data. Instead of repeatedly training AI on randomly selected data samples, our method identifies the most significant samples for the current AI model based on their data uniqueness and prediction uncertainty, then trains the AI on these selective data samples. Compared with prevalent training paradigms, our method not only improves data efficiency by enabling training on continual data streams, but also mitigates catastrophic forgetting by selectively training AI on significant data samples that might otherwise be forgotten, outperforming by 15% in Dice score for multi-organ and tumor segmentation. The code is available at https://github.com/MrGiovanni/OnlineLearning

Embracing Massive Medical Data

TL;DR

This work tackles the challenge of training AI on massive, streaming, partially labeled medical data without revisiting old data, aiming to avoid catastrophic forgetting. It introduces an online learning framework with Linear Memory, Dynamic Memory, and Selective Memory that stores recent samples, deduplicates memory, and prioritizes high-uncertainty samples using entropy-based weighting. Empirical results on a single-site abdominal CT dataset and a sequential, multi-dataset site demonstrate data efficiency comparable to multi-pass baselines, notable forgetting mitigation, and performance gains up to the order of several Dice points, especially for small structures. The approach offers a practical path toward continual, resource-efficient learning in clinical environments and sets the stage for future work in annotation-free and fine-grained continual learning for medical imaging.

Abstract

As massive medical data become available with an increasing number of scans, expanding classes, and varying sources, prevalent training paradigms -- where AI is trained with multiple passes over fixed, finite datasets -- face significant challenges. First, training AI all at once on such massive data is impractical as new scans/sources/classes continuously arrive. Second, training AI continuously on new scans/sources/classes can lead to catastrophic forgetting, where AI forgets old data as it learns new data, and vice versa. To address these two challenges, we propose an online learning method that enables training AI from massive medical data. Instead of repeatedly training AI on randomly selected data samples, our method identifies the most significant samples for the current AI model based on their data uniqueness and prediction uncertainty, then trains the AI on these selective data samples. Compared with prevalent training paradigms, our method not only improves data efficiency by enabling training on continual data streams, but also mitigates catastrophic forgetting by selectively training AI on significant data samples that might otherwise be forgotten, outperforming by 15% in Dice score for multi-organ and tumor segmentation. The code is available at https://github.com/MrGiovanni/OnlineLearning
Paper Structure (13 sections, 2 equations, 3 figures, 4 tables)

This paper contains 13 sections, 2 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Different Training Method. Linear memory stores only a few recent samples, causing significant forgetting. Dynamic memory adapts to varying data distributions by retaining unique samples, while selective memory further identifies and selects challenging samples, including those that might be duplicated, ensuring they are not missed by dynamic memory (§\ref{['sec:deliberate']}).
  • Figure 2: Catastrophic Forgetting. To evaluate forgetting, we calculate the relative Dice drop after training on the incoming sub-datasets. Both DM and SM store samples from previous sub-datasets, thereby alleviating forgetting observed with LM.
  • Figure 3: Diverse Memory. We visualize the memory to demonstrate the diversity of stored samples from previous $D_d$. Both DM and SM can retain the samples from previous sub-datasets. SM can further identify samples with higher uncertainty.