Table of Contents
Fetching ...

Masks and Manuscripts: Advancing Medical Pre-training with End-to-End Masking and Narrative Structuring

Shreyank N Gowda, David A. Clifton

TL;DR

This work tackles semantic and morphological variability in medical contrastive learning caused by heterogeneous radiology reports. It introduces a two-step framework: first standardizing text into triplets and converting them into binary observations and verdicts, and second applying Meijering-based masking for image pre-training within a multimodal contrastive setup. The approach unifies masked image modeling with text-to-triplet manuscript generation under a joint objective that combines $\mathcal{L}_{MVLM}$, $\mathcal{L}_{ITC}$, and $\mathcal{L}_{ITM}$ to achieve robust cross-modal representations. Empirical results across classification, grading, segmentation, and zero-shot tasks on multiple datasets demonstrate state-of-the-art performance and significant gains in data-scarce regimes, highlighting practical impact for medical image analysis and report generation. The method shows promise for generalization to other imaging modalities such as MRI, enhancing reliability and clinical utility of automated medical understanding.

Abstract

Contemporary medical contrastive learning faces challenges from inconsistent semantics and sample pair morphology, leading to dispersed and converging semantic shifts. The variability in text reports, due to multiple authors, complicates semantic consistency. To tackle these issues, we propose a two-step approach. Initially, text reports are converted into a standardized triplet format, laying the groundwork for our novel concept of ``observations'' and ``verdicts''. This approach refines the {Entity, Position, Exist} triplet into binary questions, guiding towards a clear ``verdict''. We also innovate in visual pre-training with a Meijering-based masking, focusing on features representative of medical images' local context. By integrating this with our text conversion method, our model advances cross-modal representation in a multimodal contrastive learning framework, setting new benchmarks in medical image analysis.

Masks and Manuscripts: Advancing Medical Pre-training with End-to-End Masking and Narrative Structuring

TL;DR

This work tackles semantic and morphological variability in medical contrastive learning caused by heterogeneous radiology reports. It introduces a two-step framework: first standardizing text into triplets and converting them into binary observations and verdicts, and second applying Meijering-based masking for image pre-training within a multimodal contrastive setup. The approach unifies masked image modeling with text-to-triplet manuscript generation under a joint objective that combines , , and to achieve robust cross-modal representations. Empirical results across classification, grading, segmentation, and zero-shot tasks on multiple datasets demonstrate state-of-the-art performance and significant gains in data-scarce regimes, highlighting practical impact for medical image analysis and report generation. The method shows promise for generalization to other imaging modalities such as MRI, enhancing reliability and clinical utility of automated medical understanding.

Abstract

Contemporary medical contrastive learning faces challenges from inconsistent semantics and sample pair morphology, leading to dispersed and converging semantic shifts. The variability in text reports, due to multiple authors, complicates semantic consistency. To tackle these issues, we propose a two-step approach. Initially, text reports are converted into a standardized triplet format, laying the groundwork for our novel concept of ``observations'' and ``verdicts''. This approach refines the {Entity, Position, Exist} triplet into binary questions, guiding towards a clear ``verdict''. We also innovate in visual pre-training with a Meijering-based masking, focusing on features representative of medical images' local context. By integrating this with our text conversion method, our model advances cross-modal representation in a multimodal contrastive learning framework, setting new benchmarks in medical image analysis.
Paper Structure (19 sections, 4 equations, 3 figures, 7 tables)

This paper contains 19 sections, 4 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: An overview of the architecture for integrated modeling of masked medical visual and linguistic data. The blue and green pathways represent the flow of information for the reconstruction of images and text, respectively. The dashed lines show the intermodal contribution of exposed signals for the creation of attention.
  • Figure 2: Comparing masking strategies: (a) Masking random patches for reconstruction often results in blurry outputs lacking in detail. (b) Filtering the image before reconstruction preserves fine-grained details, leading to higher resolution outcomes.
  • Figure 3: Our report generation process starts with the triplet extraction method from MedKLIP, as outlined in part (a). Instead of adopting the Knowledge-enhanced Triplet Encoding, we transform these into a textual report, embedding it sentence by sentence. This approach facilitates masked pre-training by providing binary labels for each observation and the verdict.