IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-training

Che Liu; Sibo Cheng; Miaojing Shi; Anand Shah; Wenjia Bai; Rossella Arcucci

IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-training

Che Liu, Sibo Cheng, Miaojing Shi, Anand Shah, Wenjia Bai, Rossella Arcucci

TL;DR

A novel clinical prior guided VLP framework named IMITATE is proposed to learn the structure information from medical reports with hierarchical vision-language alignment, which outperforms baseline VLP methods across six different datasets, spanning five medical imaging downstream tasks.

Abstract

In the field of medical Vision-Language Pre-training (VLP), significant efforts have been devoted to deriving text and image features from both clinical reports and associated medical images. However, most existing methods may have overlooked the opportunity in leveraging the inherent hierarchical structure of clinical reports, which are generally split into `findings' for descriptive content and `impressions' for conclusive observation. Instead of utilizing this rich, structured format, current medical VLP approaches often simplify the report into either a unified entity or fragmented tokens. In this work, we propose a novel clinical prior guided VLP framework named IMITATE to learn the structure information from medical reports with hierarchical vision-language alignment. The framework derives multi-level visual features from the chest X-ray (CXR) images and separately aligns these features with the descriptive and the conclusive text encoded in the hierarchical medical report. Furthermore, a new clinical-informed contrastive loss is introduced for cross-modal learning, which accounts for clinical prior knowledge in formulating sample correlations in contrastive learning. The proposed model, IMITATE, outperforms baseline VLP methods across six different datasets, spanning five medical imaging downstream tasks. Comprehensive experimental results highlight the advantages of integrating the hierarchical structure of medical reports for vision-language alignment. The code related to this paper is available at https://github.com/cheliu-computation/IMITATE-TMI2024.

IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-training

TL;DR

Abstract

Paper Structure (24 sections, 7 equations, 5 figures, 13 tables)

This paper contains 24 sections, 7 equations, 5 figures, 13 tables.

Introduction
Related Work
General Vision-Language Pre-training
Medical Vision-Language Pre-training
Method
Overview
Semantic Difference in Hierarchical Medical Report
Hierarchical Vision-Language Alignment
Clinical-Informed Contrastive loss
Total Loss
Experiments and Analysis
Vision-Language Pre-training Configuration
Downstream Tasks
Results
Medical Image Classification
...and 9 more sections

Figures (5)

Figure 1: Architecture comparison between conventional VLP methods and the proposed method, IMITATE. (a) Conventional VLP approaches convirtcliphuang2021gloriamgca align the high-level visual feature with the entire medical report via a classic contrastive loss ($\mathcal{L}^{CL}$). (b) IMITATE leverages clinical prior knowledge to perform hierarchical alignment between multi-level visual features from medical images and descriptive and conclusive textual features from medical reports. Moreover, it utilizes a clinically-informed contrastive loss ($\mathcal{L}^{CICL}$), which takes into account clinical correlations among different image-report pairs. $E_v$ and $E_t$ denotes the vision and text encoders respectively. $E_t$ denotes a frozen text encoder. $\mathcal{P}(\cdot)$ indicates the hierarchical aggregation block.
Figure 2: Overview of the proposed framework. (a) Each image is augmented to two different views ($x_{v}^{1}, x_{v}^{2}$) and provided as input to a vision-to-language (V-L) alignment branch and a vision-to-vision (V-V) alignment branch. (b) The V-V branch aligns the visual features of two augmented views. MHSA indicates the multi-head self-attention mechanism. $p_v$ denotes a non-linear projector for visual features. CLS indicates the special token to aggregate multi-level visual features. The dashed black line indicates the feature channel dropping mechanism. (c) The V-L branch aligns different levels of visual features to the text features from the Findings and Impressions sections of the report. $p_t$ denotes a non-linear projector for textual features. The [CLS] token serves to aggregate multi-level visual features to $z_{v,m}$ and facilitate hierarchical alignment between visual features and $z_{t, F}$. $E_t$ denotes a frozen pre-trained language model.
Figure 3: Left: 2D TSNE visualization of text embedding from 'Findings' and 'Impression'. We also plot the decision boundary of two parts with KMeans. Right: Two reports samples from the overlap and non-overlap area respectively. Rephrased sentence are highlighted with the same color.
Figure 4: Performance of supervised image linear classification on CheXpert irvin2019chexpert, semantic segmentation on the SIIM siim, and object detection on the RSNA rsna datasets with 1% labeled data fine-tuning, while varying $\lambda$ from $0.1$ to $1.0$.
Figure 5: Comparison between the saliency map generated by the vision encoder after VLP and the region of interest identified by radiologists.

IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-training

TL;DR

Abstract

IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-training

Authors

TL;DR

Abstract

Table of Contents

Figures (5)