Table of Contents
Fetching ...

Advancing Medical Radiograph Representation Learning: A Hybrid Pre-training Paradigm with Multilevel Semantic Granularity

Hanqi Jiang, Xixuan Hao, Yuzhou Huang, Chong Ma, Jiaxun Zhang, Yi Pan, Ruimao Zhang

TL;DR

HybridMED addresses data scarcity and semantic hierarchy in radiology reports by aligning images with $\mathcal{L}_{CG}$ at the global level and findings with $\mathcal{L}_{CL}$ at the token level, while two parallel generative decoders learn to produce the same $impression$ conditioned on either images or findings; a knowledge-distillation loss $\mathcal{L}_{Dis}$ transfers guidance from the summarization to the captioning branch. The pretraining objective combines $ \mathcal{L} = \lambda_{CG}\mathcal{L}_{CG} + \lambda_{CL}\mathcal{L}_{CL} + \lambda_{Sum}\mathcal{L}_{Sum} + \lambda_{Cap}\mathcal{L}_{Cap} + \lambda_{Dis}\mathcal{L}_{Dis}$, with a two-stage schedule (stage 1 summarization-only; stage 2 joint cross-modal alignment with a frozen summarization teacher). Experiments on MIMIC-CXR show state-of-the-art zero-shot and fine-tuned performance across RSNA Pneumonia, CheXpert, VQA-RAD, and Med-VQA2019, with qualitative attention visualizations and ablations highlighting the efficacy of multi-level alignment and distillation. Overall, HybridMED advances Med-VLP by exploiting hierarchical radiology semantics to produce richer radiograph representations with efficient parameter usage and broad downstream utility.

Abstract

This paper introduces an innovative approach to Medical Vision-Language Pre-training (Med-VLP) area in the specialized context of radiograph representation learning. While conventional methods frequently merge textual annotations into unified reports, we acknowledge the intrinsic hierarchical relationship between the findings and impression section in radiograph datasets. To establish a targeted correspondence between images and texts, we propose a novel HybridMED framework to align global-level visual representations with impression and token-level visual representations with findings. Moreover, our framework incorporates a generation decoder that employs two proxy tasks, responsible for generating the impression from (1) images, via a captioning branch, and (2) findings, through a summarization branch. Additionally, knowledge distillation is leveraged to facilitate the training process. Experiments on the MIMIC-CXR dataset reveal that our summarization branch effectively distills knowledge to the captioning branch, enhancing model performance without significantly increasing parameter requirements due to the shared self-attention and feed-forward architecture.

Advancing Medical Radiograph Representation Learning: A Hybrid Pre-training Paradigm with Multilevel Semantic Granularity

TL;DR

HybridMED addresses data scarcity and semantic hierarchy in radiology reports by aligning images with at the global level and findings with at the token level, while two parallel generative decoders learn to produce the same conditioned on either images or findings; a knowledge-distillation loss transfers guidance from the summarization to the captioning branch. The pretraining objective combines , with a two-stage schedule (stage 1 summarization-only; stage 2 joint cross-modal alignment with a frozen summarization teacher). Experiments on MIMIC-CXR show state-of-the-art zero-shot and fine-tuned performance across RSNA Pneumonia, CheXpert, VQA-RAD, and Med-VQA2019, with qualitative attention visualizations and ablations highlighting the efficacy of multi-level alignment and distillation. Overall, HybridMED advances Med-VLP by exploiting hierarchical radiology semantics to produce richer radiograph representations with efficient parameter usage and broad downstream utility.

Abstract

This paper introduces an innovative approach to Medical Vision-Language Pre-training (Med-VLP) area in the specialized context of radiograph representation learning. While conventional methods frequently merge textual annotations into unified reports, we acknowledge the intrinsic hierarchical relationship between the findings and impression section in radiograph datasets. To establish a targeted correspondence between images and texts, we propose a novel HybridMED framework to align global-level visual representations with impression and token-level visual representations with findings. Moreover, our framework incorporates a generation decoder that employs two proxy tasks, responsible for generating the impression from (1) images, via a captioning branch, and (2) findings, through a summarization branch. Additionally, knowledge distillation is leveraged to facilitate the training process. Experiments on the MIMIC-CXR dataset reveal that our summarization branch effectively distills knowledge to the captioning branch, enhancing model performance without significantly increasing parameter requirements due to the shared self-attention and feed-forward architecture.
Paper Structure (19 sections, 15 equations, 5 figures, 5 tables)

This paper contains 19 sections, 15 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: An example of Semantic Hierarchy between radiograph "findings" and "impression" from MIMIC-CXR dataset.
  • Figure 2: Captioning Loss and Summarization Loss on MIMIC-CXR validation set, where better convergence in the summarization loss indicates that under equivalent generative objectives, captioning is a more challenging task.
  • Figure 3: The HybridMED framework is presented in two parts. (a) introduces the overall framework, which encompasses multi-modal alignment across multi-level semantic hierarchies and parallel generative distillation decoders. (b) delves into the specifics of the two parallel generative decoders. The self-attention layers and feed-forward layers in these two branches share weights, while the cross-attention layers differ, conditioned on different modalities. Furthermore, the summarization branch distills its outputs to facilitate the operations of the captioning branch. (c) describes multiple downstream tasks diagram.
  • Figure 4: Results of cross-modality attention maps visualization. The related prompt is (a) Atelectasis (b) Consolidation and (c) Pleural Effusion.
  • Figure 5: t-SNE visualization results on CheXpert 5x200 dataset by CLIP and HybridMED. The figures depict points in various colors, each representing different ground truth disease types along with their corresponding cluster assignments.