Table of Contents
Fetching ...

Pre-training on High Definition X-ray Images: An Experimental Study

Xiao Wang, Yuehang Li, Wentao Wu, Jiandong Jin, Yao Rong, Bo Jiang, Chuanfu Li, Jin Tang

TL;DR

This work tackles the limitations of prior X-ray pre-training by introducing a high-definition ($1280$ $\times$ $1280$) masked auto-encoder trained on over $1$ million X-ray images. It integrates a context-aware masking strategy that prioritizes the chest region using chest contours, and uses a ViT-L encoder–decoder to reconstruct masked patches, yielding a robust foundation for downstream tasks. The pre-trained backbone is validated on two radiology tasks: English and Chinese medical report generation and disease recognition, with notable improvements over baselines and ablation-confirmed effectiveness of context-aware masking and high-resolution inputs. The results demonstrate competitive, sometimes state-of-the-art performance on downstream benchmarks, underscoring the practical potential of large-scale, high-resolution X-ray pre-training for medical AI applications, while also noting computational costs and opportunities for multi-modal enhancements in future work.

Abstract

Existing X-ray based pre-trained vision models are usually conducted on a relatively small-scale dataset (less than 500k samples) with limited resolution (e.g., 224 $\times$ 224). However, the key to the success of self-supervised pre-training large models lies in massive training data, and maintaining high resolution in the field of X-ray images is the guarantee of effective solutions to difficult miscellaneous diseases. In this paper, we address these issues by proposing the first high-definition (1280 $\times$ 1280) X-ray based pre-trained foundation vision model on our newly collected large-scale dataset which contains more than 1 million X-ray images. Our model follows the masked auto-encoder framework which takes the tokens after mask processing (with a high rate) is used as input, and the masked image patches are reconstructed by the Transformer encoder-decoder network. More importantly, we introduce a novel context-aware masking strategy that utilizes the chest contour as a boundary for adaptive masking operations. We validate the effectiveness of our model on two downstream tasks, including X-ray report generation and disease recognition. Extensive experiments demonstrate that our pre-trained medical foundation vision model achieves comparable or even new state-of-the-art performance on downstream benchmark datasets. The source code and pre-trained models of this paper will be released on https://github.com/Event-AHU/Medical_Image_Analysis.

Pre-training on High Definition X-ray Images: An Experimental Study

TL;DR

This work tackles the limitations of prior X-ray pre-training by introducing a high-definition ( ) masked auto-encoder trained on over million X-ray images. It integrates a context-aware masking strategy that prioritizes the chest region using chest contours, and uses a ViT-L encoder–decoder to reconstruct masked patches, yielding a robust foundation for downstream tasks. The pre-trained backbone is validated on two radiology tasks: English and Chinese medical report generation and disease recognition, with notable improvements over baselines and ablation-confirmed effectiveness of context-aware masking and high-resolution inputs. The results demonstrate competitive, sometimes state-of-the-art performance on downstream benchmarks, underscoring the practical potential of large-scale, high-resolution X-ray pre-training for medical AI applications, while also noting computational costs and opportunities for multi-modal enhancements in future work.

Abstract

Existing X-ray based pre-trained vision models are usually conducted on a relatively small-scale dataset (less than 500k samples) with limited resolution (e.g., 224 224). However, the key to the success of self-supervised pre-training large models lies in massive training data, and maintaining high resolution in the field of X-ray images is the guarantee of effective solutions to difficult miscellaneous diseases. In this paper, we address these issues by proposing the first high-definition (1280 1280) X-ray based pre-trained foundation vision model on our newly collected large-scale dataset which contains more than 1 million X-ray images. Our model follows the masked auto-encoder framework which takes the tokens after mask processing (with a high rate) is used as input, and the masked image patches are reconstructed by the Transformer encoder-decoder network. More importantly, we introduce a novel context-aware masking strategy that utilizes the chest contour as a boundary for adaptive masking operations. We validate the effectiveness of our model on two downstream tasks, including X-ray report generation and disease recognition. Extensive experiments demonstrate that our pre-trained medical foundation vision model achieves comparable or even new state-of-the-art performance on downstream benchmark datasets. The source code and pre-trained models of this paper will be released on https://github.com/Event-AHU/Medical_Image_Analysis.
Paper Structure (30 sections, 1 equation, 11 figures, 6 tables)

This paper contains 30 sections, 1 equation, 11 figures, 6 tables.

Figures (11)

  • Figure 1: (a) An illustration of our proposed high-definition X-ray image based pre-training framework using masked auto-encoder. (b, c) are two downstream tasks used for the validation of our pre-training framework.
  • Figure 2: The detailed architectures of Transformer from Vaswani_2017_Attention.
  • Figure 3: Some representative samples of our collected PCC-Xray dataset.
  • Figure 4: The word cloud of our newly collected PCC-Xray dataset for Chinese medical report generation.
  • Figure 5: Variation of the accuracy on the IU-Xray testing dataset in the training phase.
  • ...and 6 more figures