Masks and Manuscripts: Advancing Medical Pre-training with End-to-End Masking and Narrative Structuring
Shreyank N Gowda, David A. Clifton
TL;DR
This work tackles semantic and morphological variability in medical contrastive learning caused by heterogeneous radiology reports. It introduces a two-step framework: first standardizing text into triplets and converting them into binary observations and verdicts, and second applying Meijering-based masking for image pre-training within a multimodal contrastive setup. The approach unifies masked image modeling with text-to-triplet manuscript generation under a joint objective that combines $\mathcal{L}_{MVLM}$, $\mathcal{L}_{ITC}$, and $\mathcal{L}_{ITM}$ to achieve robust cross-modal representations. Empirical results across classification, grading, segmentation, and zero-shot tasks on multiple datasets demonstrate state-of-the-art performance and significant gains in data-scarce regimes, highlighting practical impact for medical image analysis and report generation. The method shows promise for generalization to other imaging modalities such as MRI, enhancing reliability and clinical utility of automated medical understanding.
Abstract
Contemporary medical contrastive learning faces challenges from inconsistent semantics and sample pair morphology, leading to dispersed and converging semantic shifts. The variability in text reports, due to multiple authors, complicates semantic consistency. To tackle these issues, we propose a two-step approach. Initially, text reports are converted into a standardized triplet format, laying the groundwork for our novel concept of ``observations'' and ``verdicts''. This approach refines the {Entity, Position, Exist} triplet into binary questions, guiding towards a clear ``verdict''. We also innovate in visual pre-training with a Meijering-based masking, focusing on features representative of medical images' local context. By integrating this with our text conversion method, our model advances cross-modal representation in a multimodal contrastive learning framework, setting new benchmarks in medical image analysis.
