Table of Contents
Fetching ...

OTCXR: Rethinking Self-supervised Alignment using Optimal Transport for Chest X-ray Analysis

Vandan Gorade, Azad Singh, Deepak Mishra

TL;DR

OTCXR addresses the gap in chest X-ray SSL by reframing semantic alignment as an optimal-transport problem over dense feature maps and enriching it with a Cross-Viewpoint Semantics Infusion Module. The framework optimizes a transport-based alignment loss $L_{OT}$ combined with variance and covariance regularizations, yielding a final objective $L_{MT}=\alpha L_{OT}+\beta[var(q_s)+var(q_t)]+\eta[cov(q_s)+cov(q_t)]$. Empirical results on NIH Chest X-ray14, VinBig-CXR, and RSNA show OTCXR outperforms state-of-the-art SSL methods across finetuning and linear evaluation, with strong transferability to segmentation tasks. The approach demonstrates that dense semantic invariance and cross-view contextualization improve both localization and diagnostic accuracy, particularly in limited-label regimes. These findings highlight the practical potential of OT-based SSL for data-efficient medical image analysis and broader chest X-ray applications.

Abstract

Self-supervised learning (SSL) has emerged as a promising technique for analyzing medical modalities such as X-rays due to its ability to learn without annotations. However, conventional SSL methods face challenges in achieving semantic alignment and capturing subtle details, which limits their ability to accurately represent the underlying anatomical structures and pathological features. To address these limitations, we propose OTCXR, a novel SSL framework that leverages optimal transport (OT) to learn dense semantic invariance. By integrating OT with our innovative Cross-Viewpoint Semantics Infusion Module (CV-SIM), OTCXR enhances the model's ability to capture not only local spatial features but also global contextual dependencies across different viewpoints. This approach enriches the effectiveness of SSL in the context of chest radiographs. Furthermore, OTCXR incorporates variance and covariance regularizations within the OT framework to prioritize clinically relevant information while suppressing less informative features. This ensures that the learned representations are comprehensive and discriminative, particularly beneficial for tasks such as thoracic disease diagnosis. We validate OTCXR's efficacy through comprehensive experiments on three publicly available chest X-ray datasets. Our empirical results demonstrate the superiority of OTCXR over state-of-the-art methods across all evaluated tasks, confirming its capability to learn semantically rich representations.

OTCXR: Rethinking Self-supervised Alignment using Optimal Transport for Chest X-ray Analysis

TL;DR

OTCXR addresses the gap in chest X-ray SSL by reframing semantic alignment as an optimal-transport problem over dense feature maps and enriching it with a Cross-Viewpoint Semantics Infusion Module. The framework optimizes a transport-based alignment loss combined with variance and covariance regularizations, yielding a final objective . Empirical results on NIH Chest X-ray14, VinBig-CXR, and RSNA show OTCXR outperforms state-of-the-art SSL methods across finetuning and linear evaluation, with strong transferability to segmentation tasks. The approach demonstrates that dense semantic invariance and cross-view contextualization improve both localization and diagnostic accuracy, particularly in limited-label regimes. These findings highlight the practical potential of OT-based SSL for data-efficient medical image analysis and broader chest X-ray applications.

Abstract

Self-supervised learning (SSL) has emerged as a promising technique for analyzing medical modalities such as X-rays due to its ability to learn without annotations. However, conventional SSL methods face challenges in achieving semantic alignment and capturing subtle details, which limits their ability to accurately represent the underlying anatomical structures and pathological features. To address these limitations, we propose OTCXR, a novel SSL framework that leverages optimal transport (OT) to learn dense semantic invariance. By integrating OT with our innovative Cross-Viewpoint Semantics Infusion Module (CV-SIM), OTCXR enhances the model's ability to capture not only local spatial features but also global contextual dependencies across different viewpoints. This approach enriches the effectiveness of SSL in the context of chest radiographs. Furthermore, OTCXR incorporates variance and covariance regularizations within the OT framework to prioritize clinically relevant information while suppressing less informative features. This ensures that the learned representations are comprehensive and discriminative, particularly beneficial for tasks such as thoracic disease diagnosis. We validate OTCXR's efficacy through comprehensive experiments on three publicly available chest X-ray datasets. Our empirical results demonstrate the superiority of OTCXR over state-of-the-art methods across all evaluated tasks, confirming its capability to learn semantically rich representations.
Paper Structure (11 sections, 3 equations, 3 figures, 3 tables)

This paper contains 11 sections, 3 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Architecture of the OTCXR Framework. $I_s$ and $I_t$ are two augmented versions of $I$, pass through backbone encoders $f_s$ and $f_t$ to obtain $z_s$ and $z_t$ which OT solver subsequently utilizes along with the cost matrix $M$. $g_s$ and $g_t$ are the feature vectors after the global average pooling (GAP) layer, which are expanded to $q_s$ and $q_t$ to increase representational variability.
  • Figure 2: Diagnostic heat maps generated by OTCXR and the considered SSL baseline methods represent interpretations of chest X-ray images fine-tuned with 1% of training samples from the NIH dataset.
  • Figure 3: Segmentation on SIIM-ACR Pneumothorax dataset obtained after fine-tuning the representations obtained from NIH pre-training.