Self-supervised vision-langage alignment of deep learning representations for bone X-rays analysis

Alexandre Englebert; Anne-Sophie Collin; Olivier Cornu; Christophe De Vleeschouwer

Self-supervised vision-langage alignment of deep learning representations for bone X-rays analysis

Alexandre Englebert, Anne-Sophie Collin, Olivier Cornu, Christophe De Vleeschouwer

TL;DR

This work tackles the scarcity and language limitations of large radiography datasets by performing vision-language pretraining (VLP) on bone X-rays paired with French radiology reports from a single hospital, with a strong anonymization pipeline. It demonstrates that multilingual text encoders and UMLS-based self-alignment (Sap) can enhance image-text embedding spaces, yielding competitive downstream performance on fracture detection, bone age estimation, and osteoarthritis quantification, even with limited supervision. The study systematically evaluates in-hospital and cross-hospital tasks, zero-shot classification and retrieval, and provides extensive latent-space analyses to understand task transfer and data efficiency. The practical impact lies in enabling effective deployment of vision models in healthcare under privacy constraints and language-specific settings, supported by open-source code and dataset processing workflows.

Abstract

This paper proposes leveraging vision-language pretraining on bone X-rays paired with French reports to address downstream tasks of interest on bone radiography. A practical processing pipeline is introduced to anonymize and process French medical reports. Pretraining then consists in the self-supervised alignment of visual and textual embedding spaces derived from deep model encoders. The resulting image encoder is then used to handle various downstream tasks, including quantification of osteoarthritis, estimation of bone age on pediatric wrists, bone fracture and anomaly detection. Our approach demonstrates competitive performance on downstream tasks, compared to alternatives requiring a significantly larger amount of human expert annotations. Our work stands as the first study to integrate French reports to shape the embedding space devoted to bone X-Rays representations, capitalizing on the large quantity of paired images and reports data available in an hospital. By relying on generic vision-laguage deep models in a language-specific scenario, it contributes to the deployement of vision models for wider healthcare applications.

Self-supervised vision-langage alignment of deep learning representations for bone X-rays analysis

TL;DR

Abstract

Paper Structure (35 sections, 8 figures, 9 tables)

This paper contains 35 sections, 8 figures, 9 tables.

Introduction
Related Work
Self-Supervised Learning
Medical applications of self-supervised Vision-Language Pretraining
Methodology
Learning joint representations for clinical reports and X-ray images
Vision-language pretraining (VLP)
Text encoder and self-alignment pretraining (Sap)
Image Encoder
Downstream tasks
Tasks trained on images captured in the same hospital than the VLP ones
Tasks trained on images captured outside the hospital considered by VLP
Zero-Shot Tasks
Experimental validation
Vision-language pretraining on Bone X-Rays and French Reports
...and 20 more sections

Figures (8)

Figure 1: General overview: Vision-Language Pretraining (VLP) consists in the alignment of the embeddings for both X-Rays and French Reports. Once pretrained, the encoders can be adapted to different downstream tasks.
Figure 2: Classification AUROC achieved when training a linear projection of a frozen vision encoder on varying numbers of images obtained from the same hospital as the dataset used for vision language pretraining (VLP). The shaded areas around the lines represent the 95% confidence intervals, calculated from 8 training sessions, with different seeds used for sampling these images. The VLP models achieved better performance than a model trained on ImageNet, even when using an order of magnitude fewer images during training. Only two VLP models are represented, the others are shown in Appendix \ref{['app:osrx_vlp_all']}.
Figure 3: t-SNE visualizations of the embeddings of MURA images with and without VLP pretraining. The VLP models show a better clustering in comparison to the ImageNet model for both anatomical locations and labels. No training on MURA was conducted for any of the models. Clusters tends to form predominantly based on the anatomical location. However, within a specific anatomical site, various clusters frequently emerge, the most notable being clusters with osteosynthesis material (that can be visualized in the second line of the Figure as clusters composed of only abnormal images). Figures for all models can be found in Appendix \ref{['app:latent_additional']}.
Figure 4: LDA visualizations of the embeddings of MURA images with and without VLP pretraining (not trained on MURA). The VLP models show a better separability in comparison to ImageNet with more shifted peaks. No training on MURA was conducted for any of the models. Figures for all models can be found in Appendix \ref{['app:latent_additional']}.
Figure 5: Classification results achieved using a linear projection of a frozen vision encoder on varying numbers of images obtained from the same hospital as the pretraining dataset. The shaded areas around the lines represent the 95% confidence intervals, calculated from 8 training sessions for each specified number of training images, with different seeds used for sampling these images. This additional figure show the performances of all the VLP models. ImageNet and random models were removed for better view. See Figure \ref{['fig:osrx_plot']} for comparison with ImageNet and Random models.
...and 3 more figures

Self-supervised vision-langage alignment of deep learning representations for bone X-rays analysis

TL;DR

Abstract

Self-supervised vision-langage alignment of deep learning representations for bone X-rays analysis

Authors

TL;DR

Abstract

Table of Contents

Figures (8)