Table of Contents
Fetching ...

A Medical Data-Effective Learning Benchmark for Highly Efficient Pre-training of Foundation Models

Wenxuan Yang, Weimin Tan, Yuqi Sun, Bo Yan

TL;DR

The paper tackles whether larger pre-training data is always better for medical foundation models and argues that data quality can achieve similar or better downstream performance with far less data. It introduces DataDEL, a million-scale, multi-center medical dataset; MedDEL, a data-efficient baseline; and NormDEL, a combined metric for data effectiveness to evaluate the trade-off between performance and data retention. MedDEL filters data via ViT-based embeddings and clustering to remove disruptive and redundant samples, achieving competitive downstream performance with only 5% of the pretraining data. NormDEL formalizes data efficiency as $DEL = mIoU * e^{-α R}$ and $NormDEL = 1 / (1 + e^{-DEL})$, providing a single objective to compare data-use efficiency across tasks.

Abstract

Foundation models, pre-trained on massive datasets, have achieved unprecedented generalizability. However, is it truly necessary to involve such vast amounts of data in pre-training, consuming extensive computational resources? This paper introduces data-effective learning, aiming to use data in the most impactful way to pre-train foundation models. This involves strategies that focus on data quality rather than quantity, ensuring the data used for training has high informational value. Data-effective learning plays a profound role in accelerating foundation model training, reducing computational costs, and saving data storage, which is very important as the volume of medical data in recent years has grown beyond many people's expectations. However, due to the lack of standards and comprehensive benchmarks, research on medical data-effective learning is poorly studied. To address this gap, our paper introduces a comprehensive benchmark specifically for evaluating data-effective learning in the medical field. This benchmark includes a dataset with millions of data samples from 31 medical centers (DataDEL), a baseline method for comparison (MedDEL), and a new evaluation metric (NormDEL) to objectively measure data-effective learning performance. Our extensive experimental results show the baseline MedDEL can achieve performance comparable to the original large dataset with only 5% of the data. Establishing such an open data-effective learning benchmark is crucial for the medical foundation model research community because it facilitates efficient data use, promotes collaborative breakthroughs, and fosters the development of cost-effective, scalable, and impactful healthcare solutions.

A Medical Data-Effective Learning Benchmark for Highly Efficient Pre-training of Foundation Models

TL;DR

The paper tackles whether larger pre-training data is always better for medical foundation models and argues that data quality can achieve similar or better downstream performance with far less data. It introduces DataDEL, a million-scale, multi-center medical dataset; MedDEL, a data-efficient baseline; and NormDEL, a combined metric for data effectiveness to evaluate the trade-off between performance and data retention. MedDEL filters data via ViT-based embeddings and clustering to remove disruptive and redundant samples, achieving competitive downstream performance with only 5% of the pretraining data. NormDEL formalizes data efficiency as and , providing a single objective to compare data-use efficiency across tasks.

Abstract

Foundation models, pre-trained on massive datasets, have achieved unprecedented generalizability. However, is it truly necessary to involve such vast amounts of data in pre-training, consuming extensive computational resources? This paper introduces data-effective learning, aiming to use data in the most impactful way to pre-train foundation models. This involves strategies that focus on data quality rather than quantity, ensuring the data used for training has high informational value. Data-effective learning plays a profound role in accelerating foundation model training, reducing computational costs, and saving data storage, which is very important as the volume of medical data in recent years has grown beyond many people's expectations. However, due to the lack of standards and comprehensive benchmarks, research on medical data-effective learning is poorly studied. To address this gap, our paper introduces a comprehensive benchmark specifically for evaluating data-effective learning in the medical field. This benchmark includes a dataset with millions of data samples from 31 medical centers (DataDEL), a baseline method for comparison (MedDEL), and a new evaluation metric (NormDEL) to objectively measure data-effective learning performance. Our extensive experimental results show the baseline MedDEL can achieve performance comparable to the original large dataset with only 5% of the data. Establishing such an open data-effective learning benchmark is crucial for the medical foundation model research community because it facilitates efficient data use, promotes collaborative breakthroughs, and fosters the development of cost-effective, scalable, and impactful healthcare solutions.
Paper Structure (14 sections, 5 equations, 16 figures, 4 tables, 1 algorithm)

This paper contains 14 sections, 5 equations, 16 figures, 4 tables, 1 algorithm.

Figures (16)

  • Figure 1: Data-Effective Learning (DEL) enables more efficient pre-training of foundational models. (a) Data-effective learning aims to obtain a compact small dataset from a large-scale pre-training dataset, but the two datasets have similar effects on foundation model pre-training. (b) Demonstration of our comprehensive benchmark for DEL. The benchmark includes a dataset of millions of data samples from 31 medical centers (DataDEL), a baseline method for comparison (MedDEL), and a new evaluation metric (NormDEL).
  • Figure 2: Pipeline of the baseline method (MedDEL) for data-effective learning in our benchmark. It illustrates effective removal of disruptive and invalid data from the dataset, aiming to save storage space and computational resources while enhancing model efficiency.
  • Figure 3: Demonstration of the feasibility of data-effective learning. We compared the performance differences between using only 5% of the pre-training data (in red) and using 100% of the data (in black) in 8 datasets. The results indicate that using only 5% of the pre-training data can achieve results comparable to using 100% of the pre-training data, which fully demonstrates the validity of the MedDEL method.
  • Figure 5: Demonstration of images deleted by MedDEL. This figure shows MedDEL deleting semantically similar images, which appear to have no significant differences between them from a perceptual perspective,
  • Figure 6: Model generalization experiments across different datasets with different data volumes. The experiments included four distinct datasets: Kvasir-SEG, ImageCLEFmed, CVC-ClinicDB, and CVC-ColonDB. The results indicate the performance of the model at different volumes of pre-training data (5%, 20%, 50%, and 100%), and compared the outcomes between randomly selected data and data selected using the MedDEL method.
  • ...and 11 more figures