OTCXR: Rethinking Self-supervised Alignment using Optimal Transport for Chest X-ray Analysis
Vandan Gorade, Azad Singh, Deepak Mishra
TL;DR
OTCXR addresses the gap in chest X-ray SSL by reframing semantic alignment as an optimal-transport problem over dense feature maps and enriching it with a Cross-Viewpoint Semantics Infusion Module. The framework optimizes a transport-based alignment loss $L_{OT}$ combined with variance and covariance regularizations, yielding a final objective $L_{MT}=\alpha L_{OT}+\beta[var(q_s)+var(q_t)]+\eta[cov(q_s)+cov(q_t)]$. Empirical results on NIH Chest X-ray14, VinBig-CXR, and RSNA show OTCXR outperforms state-of-the-art SSL methods across finetuning and linear evaluation, with strong transferability to segmentation tasks. The approach demonstrates that dense semantic invariance and cross-view contextualization improve both localization and diagnostic accuracy, particularly in limited-label regimes. These findings highlight the practical potential of OT-based SSL for data-efficient medical image analysis and broader chest X-ray applications.
Abstract
Self-supervised learning (SSL) has emerged as a promising technique for analyzing medical modalities such as X-rays due to its ability to learn without annotations. However, conventional SSL methods face challenges in achieving semantic alignment and capturing subtle details, which limits their ability to accurately represent the underlying anatomical structures and pathological features. To address these limitations, we propose OTCXR, a novel SSL framework that leverages optimal transport (OT) to learn dense semantic invariance. By integrating OT with our innovative Cross-Viewpoint Semantics Infusion Module (CV-SIM), OTCXR enhances the model's ability to capture not only local spatial features but also global contextual dependencies across different viewpoints. This approach enriches the effectiveness of SSL in the context of chest radiographs. Furthermore, OTCXR incorporates variance and covariance regularizations within the OT framework to prioritize clinically relevant information while suppressing less informative features. This ensures that the learned representations are comprehensive and discriminative, particularly beneficial for tasks such as thoracic disease diagnosis. We validate OTCXR's efficacy through comprehensive experiments on three publicly available chest X-ray datasets. Our empirical results demonstrate the superiority of OTCXR over state-of-the-art methods across all evaluated tasks, confirming its capability to learn semantically rich representations.
