Table of Contents
Fetching ...

How Much Data are Enough? Investigating Dataset Requirements for Patch-Based Brain MRI Segmentation Tasks

Dongang Wang, Peilin Liu, Hengrui Wang, Heidi Beadnall, Kain Kyle, Linda Ly, Mariano Cabezas, Geng Zhan, Ryan Sullivan, Weidong Cai, Wanli Ouyang, Fernando Calamante, Michael Barnett, Chenyu Wang

TL;DR

This work tackles the data-need problem in patch-based brain MRI segmentation by introducing MinBAT, a Markov-process-based method to set task-specific DSC targets, and REPS, a ROI-driven patch sampling strategy that standardizes case contributions. Together, they enable an explicit, data-driven estimate of how many cases and ROIs are needed to reach acceptable segmentation performance, demonstrated across brain extraction, tumor segmentation, and MS lesion segmentation. The framework reveals that DSC targets correlate with the ROI surface-to-volume ratio $C=\S/\V$ and that REPS improves data-efficiency, reducing the required data relative to baseline random patch selection. The proposed approach offers practical guidance for planning data collection and federated learning in medical imaging, with potential applicability to other 3D segmentation domains by providing task-specific, evidence-based data size estimates before model development.

Abstract

Training deep neural networks reliably requires access to large-scale datasets. However, obtaining such datasets can be challenging, especially in the context of neuroimaging analysis tasks, where the cost associated with image acquisition and annotation can be prohibitive. To mitigate both the time and financial costs associated with model development, a clear understanding of the amount of data required to train a satisfactory model is crucial. This paper focuses on an early stage phase of deep learning research, prior to model development, and proposes a strategic framework for estimating the amount of annotated data required to train patch-based segmentation networks. This framework includes the establishment of performance expectations using a novel Minor Boundary Adjustment for Threshold (MinBAT) method, and standardizing patch selection through the ROI-based Expanded Patch Selection (REPS) method. Our experiments demonstrate that tasks involving regions of interest (ROIs) with different sizes or shapes may yield variably acceptable Dice Similarity Coefficient (DSC) scores. By setting an acceptable DSC as the target, the required amount of training data can be estimated and even predicted as data accumulates. This approach could assist researchers and engineers in estimating the cost associated with data collection and annotation when defining a new segmentation task based on deep neural networks, ultimately contributing to their efficient translation to real-world applications.

How Much Data are Enough? Investigating Dataset Requirements for Patch-Based Brain MRI Segmentation Tasks

TL;DR

This work tackles the data-need problem in patch-based brain MRI segmentation by introducing MinBAT, a Markov-process-based method to set task-specific DSC targets, and REPS, a ROI-driven patch sampling strategy that standardizes case contributions. Together, they enable an explicit, data-driven estimate of how many cases and ROIs are needed to reach acceptable segmentation performance, demonstrated across brain extraction, tumor segmentation, and MS lesion segmentation. The framework reveals that DSC targets correlate with the ROI surface-to-volume ratio and that REPS improves data-efficiency, reducing the required data relative to baseline random patch selection. The proposed approach offers practical guidance for planning data collection and federated learning in medical imaging, with potential applicability to other 3D segmentation domains by providing task-specific, evidence-based data size estimates before model development.

Abstract

Training deep neural networks reliably requires access to large-scale datasets. However, obtaining such datasets can be challenging, especially in the context of neuroimaging analysis tasks, where the cost associated with image acquisition and annotation can be prohibitive. To mitigate both the time and financial costs associated with model development, a clear understanding of the amount of data required to train a satisfactory model is crucial. This paper focuses on an early stage phase of deep learning research, prior to model development, and proposes a strategic framework for estimating the amount of annotated data required to train patch-based segmentation networks. This framework includes the establishment of performance expectations using a novel Minor Boundary Adjustment for Threshold (MinBAT) method, and standardizing patch selection through the ROI-based Expanded Patch Selection (REPS) method. Our experiments demonstrate that tasks involving regions of interest (ROIs) with different sizes or shapes may yield variably acceptable Dice Similarity Coefficient (DSC) scores. By setting an acceptable DSC as the target, the required amount of training data can be estimated and even predicted as data accumulates. This approach could assist researchers and engineers in estimating the cost associated with data collection and annotation when defining a new segmentation task based on deep neural networks, ultimately contributing to their efficient translation to real-world applications.
Paper Structure (18 sections, 12 equations, 9 figures, 1 table)

This paper contains 18 sections, 12 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Visualization of expectation of DSC with respect to ratio $\S/\V$. The proposed DSC thresholds of tasks are displayed as stars. The blue curve represents Equation \ref{['eq4:integral']} under the assumption of minimal mask variation near the boundary, and red stars represent the estimation results with these standard boundary changes. The orange line represents a specific scenario with increased boundary changes, exemplified by the tumor (depicted by the green star). More details and discussions can be found in Section \ref{['sec4:exp-thresh']} and Section \ref{['sec4:hyper']}.
  • Figure 2: Example of the MinBAT process to generate random changes at the boundaries of ROI. Numbers 1 and 3 represent two steps of random dilation, and numbers 2 and 4 represent two steps of random erosion. The black dashed square marks the original mask. In our settings, the random probabilities for morphological changes were set to 0.5. In our experiments for real masks, such a process was conducted on 3D boundaries.
  • Figure 3: The pipeline of our ROI-based Expanded Patch Selection (REPS) and its application in selecting the expected number of cases. Three examples are shown to represent brain extraction (top), tumor extraction (middle), and lesion segmentation (bottom). Red patches are used in model training, where the patches with the same size as blue ones are selected randomly from each red patch in one epoch.
  • Figure 4: The performance of three tasks (brain extraction, tumor segmentation, and lesion segmentation) at different numbers of training data. The black dashed line marks the expected DSC score determined by MinBAT, and the estimated required data numbers are shown at the crossing point.
  • Figure 5: The performance of three tasks (brain extraction, tumor segmentation, and lesion segmentation) at different numbers of training ROIs. The black dashed line marks the expected DSC score determined by MinBAT, and the estimated required ROI numbers are shown at the crossing point.
  • ...and 4 more figures