Table of Contents
Fetching ...

Foundation X: Integrating Classification, Localization, and Segmentation through Lock-Release Pretraining Strategy for Chest X-ray Analysis

Nahid Ul Islam, DongAo Ma, Jiaxuan Pang, Shivasakthi Senthil Velan, Michael Gotway, Jianming Liang

TL;DR

Foundation X addresses the challenge of heterogeneous, multi-task annotation in chest X-ray analysis by training a shared-backbone multitask model across 11 public datasets for classification, localization, and segmentation. It introduces Cyclic Pretraining, Lock-Release Pretraining, and a Student-Teacher framework with EMA to balance learning, prevent forgetting, and encourage consistency across tasks. The approach achieves gains over task-specific baselines, demonstrates strong cross-dataset and cross-task generalization, and improves organ localization/segmentation through dedicated dataset-specific heads and decoders. This framework maximizes annotation utilization, reduces labeling costs, and enables robust, adaptable chest X-ray analysis suitable for real-world clinical deployment.

Abstract

Developing robust and versatile deep-learning models is essential for enhancing diagnostic accuracy and guiding clinical interventions in medical imaging, but it requires a large amount of annotated data. The advancement of deep learning has facilitated the creation of numerous medical datasets with diverse expert-level annotations. Aggregating these datasets can maximize data utilization and address the inadequacy of labeled data. However, the heterogeneity of expert-level annotations across tasks such as classification, localization, and segmentation presents a significant challenge for learning from these datasets. To this end, we introduce nFoundation X, an end-to-end framework that utilizes diverse expert-level annotations from numerous public datasets to train a foundation model capable of multiple tasks including classification, localization, and segmentation. To address the challenges of annotation and task heterogeneity, we propose a Lock-Release pretraining strategy to enhance the cyclic learning from multiple datasets, combined with the student-teacher learning paradigm, ensuring the model retains general knowledge for all tasks while preventing overfitting to any single task. To demonstrate the effectiveness of Foundation X, we trained a model using 11 chest X-ray datasets, covering annotations for classification, localization, and segmentation tasks. Our experimental results show that Foundation X achieves notable performance gains through extensive annotation utilization, excels in cross-dataset and cross-task learning, and further enhances performance in organ localization and segmentation tasks. All code and pretrained models are publicly accessible at https://github.com/jlianglab/Foundation_X.

Foundation X: Integrating Classification, Localization, and Segmentation through Lock-Release Pretraining Strategy for Chest X-ray Analysis

TL;DR

Foundation X addresses the challenge of heterogeneous, multi-task annotation in chest X-ray analysis by training a shared-backbone multitask model across 11 public datasets for classification, localization, and segmentation. It introduces Cyclic Pretraining, Lock-Release Pretraining, and a Student-Teacher framework with EMA to balance learning, prevent forgetting, and encourage consistency across tasks. The approach achieves gains over task-specific baselines, demonstrates strong cross-dataset and cross-task generalization, and improves organ localization/segmentation through dedicated dataset-specific heads and decoders. This framework maximizes annotation utilization, reduces labeling costs, and enables robust, adaptable chest X-ray analysis suitable for real-world clinical deployment.

Abstract

Developing robust and versatile deep-learning models is essential for enhancing diagnostic accuracy and guiding clinical interventions in medical imaging, but it requires a large amount of annotated data. The advancement of deep learning has facilitated the creation of numerous medical datasets with diverse expert-level annotations. Aggregating these datasets can maximize data utilization and address the inadequacy of labeled data. However, the heterogeneity of expert-level annotations across tasks such as classification, localization, and segmentation presents a significant challenge for learning from these datasets. To this end, we introduce nFoundation X, an end-to-end framework that utilizes diverse expert-level annotations from numerous public datasets to train a foundation model capable of multiple tasks including classification, localization, and segmentation. To address the challenges of annotation and task heterogeneity, we propose a Lock-Release pretraining strategy to enhance the cyclic learning from multiple datasets, combined with the student-teacher learning paradigm, ensuring the model retains general knowledge for all tasks while preventing overfitting to any single task. To demonstrate the effectiveness of Foundation X, we trained a model using 11 chest X-ray datasets, covering annotations for classification, localization, and segmentation tasks. Our experimental results show that Foundation X achieves notable performance gains through extensive annotation utilization, excels in cross-dataset and cross-task learning, and further enhances performance in organ localization and segmentation tasks. All code and pretrained models are publicly accessible at https://github.com/jlianglab/Foundation_X.

Paper Structure

This paper contains 13 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: The proposed Foundation X model (detailed in Sec. \ref{['sec:methods']}) can utilize multiple datasets (Dataset #1 to #11) for pretraining and can also incorporate additional datasets (Dataset #N) dynamically into the pretraining process. The model is trained cyclically, processing each dataset sequentially. Each dataset may include one, two, or all three tasks: classification, localization, and segmentation. The figure illustrates the process with a dataset (e.g., 8. CANDID-PTX feng2022automated) containing all three types of ground truths. The process begins with the student model (Swin-B) extracting relevant features from the input dataset, which are then directed sequentially to the appropriate branch. First, for classification, features are processed through the classification head ($C_8$). Second, for localization, features pass through the localization encoder and corresponding localization decoder ($L_3$). Third, for segmentation, features are handled by the segmentation decoder and segmentation head ($S_1$). The model undergoes two-phase training for each task: lock mode with most layers frozen, followed by release mode with all layers trainable. Additionally, the model uses a student-teacher learning paradigm. The teacher model, an identical copy of the student model, is updated after each epoch using an exponential moving average (EMA). We calculate the consistency loss ($L_{\text{const}}$) in three areas: extracted features from the backbone, features from the localization encoder, and features from the segmentation decoder. If a dataset contains only one or two types of ground truths, the model will skip the branch without the corresponding ground truth. The Foundation X model uses the Cyclic and Lock-Release pretraining strategies to enhance performance across tasks while preventing forgetting and avoiding task-specific overfitting.
  • Figure 2: Cross-Dataset & Cross-Task Learning Analysis. The figure demonstrates the performance trends of Foundation X across multiple datasets for both focused and unfocused training scenarios. Focused training refers to scenarios where the model is explicitly trained on the specific dataset being evaluated. In contrast, unfocused training refers to scenarios where the model is trained on other datasets and not directly on the one being evaluated. The green, orange, and blue lines represent classification, localization, and segmentation tasks, respectively. Dark-colored lines indicate focused training results, while light-colored lines show unfocused training results. Dashed lines represent the best testing outcomes from focused training. In some cases, unfocused training surpasses focused training, highlighting the benefits of cross-task and cross-dataset learning in enhancing Foundation X's capabilities. The model efficiently generalizes, retains knowledge of previous tasks, and avoids overfitting during pretraining.
  • Figure 3: Full finetuning of Foundation X outperforms both head-only finetuning Foundation X and the baseline Swin-B + DINO model with Ark-6 ma2023foundation initialized backbone weights. All three settings followed the same hyperparameters as mentioned in the supplementary material.