Table of Contents
Fetching ...

DeViDe: Faceted medical knowledge for improved medical vision-language pre-training

Haozhe Luo, Ziyu Zhou, Corentin Royer, Anjany Sekuboyina, Bjoern Menze

TL;DR

DeViDe tackles the limited integration of medical knowledge in vision-language pre-training for chest X-rays by aggregating multi-source knowledge—radiographic descriptions from Radiopaedia, medical definitions, and preprocessed radiology reports—and aligning it with images at multiple granularities. The framework encodes images with ViT-B and textual knowledge with Med-KEBERT, using a knowledge retrieval module and cross-view fusion, optimized by a combination of $\mathcal{L}_{itc}$, $\mathcal{L}_{tnc}$, and $\mathcal{L}_{pta}$ losses, plus a $\mathcal{L}_{bce}$ term when applicable: $\mathcal{L} = \mathcal{L}_{bce} + \mathcal{L}_{itc} + \mathcal{L}_{tnc} + \alpha\mathcal{L}_{pta}$. The model is pretrained on MIMIC-CXRv2 and evaluated in zero-shot and finetuning regimes, achieving state-of-the-art results on several large-scale datasets and strong segmentation performance across diverse distributions, with notable gains on detailed radiographic findings. Qualitative analyses show precise visual grounding and sentence-level attention that reflect the correspondence between descriptions and image regions, supporting improved interpretability. Overall, DeViDe demonstrates that leveraging multi-granularity medical knowledge during pre-training substantially enhances open-world disease detection and data efficiency in downstream tasks.

Abstract

Vision-language pre-training for chest X-rays has made significant strides, primarily by utilizing paired radiographs and radiology reports. However, existing approaches often face challenges in encoding medical knowledge effectively. While radiology reports provide insights into the current disease manifestation, medical definitions (as used by contemporary methods) tend to be overly abstract, creating a gap in knowledge. To address this, we propose DeViDe, a novel transformer-based method that leverages radiographic descriptions from the open web. These descriptions outline general visual characteristics of diseases in radiographs, and when combined with abstract definitions and radiology reports, provide a holistic snapshot of knowledge. DeViDe incorporates three key features for knowledge-augmented vision language alignment: First, a large-language model-based augmentation is employed to homogenise medical knowledge from diverse sources. Second, this knowledge is aligned with image information at various levels of granularity. Third, a novel projection layer is proposed to handle the complexity of aligning each image with multiple descriptions arising in a multi-label setting. In zero-shot settings, DeViDe performs comparably to fully supervised models on external datasets and achieves state-of-the-art results on three large-scale datasets. Additionally, fine-tuning DeViDe on four downstream tasks and six segmentation tasks showcases its superior performance across data from diverse distributions.

DeViDe: Faceted medical knowledge for improved medical vision-language pre-training

TL;DR

DeViDe tackles the limited integration of medical knowledge in vision-language pre-training for chest X-rays by aggregating multi-source knowledge—radiographic descriptions from Radiopaedia, medical definitions, and preprocessed radiology reports—and aligning it with images at multiple granularities. The framework encodes images with ViT-B and textual knowledge with Med-KEBERT, using a knowledge retrieval module and cross-view fusion, optimized by a combination of , , and losses, plus a term when applicable: . The model is pretrained on MIMIC-CXRv2 and evaluated in zero-shot and finetuning regimes, achieving state-of-the-art results on several large-scale datasets and strong segmentation performance across diverse distributions, with notable gains on detailed radiographic findings. Qualitative analyses show precise visual grounding and sentence-level attention that reflect the correspondence between descriptions and image regions, supporting improved interpretability. Overall, DeViDe demonstrates that leveraging multi-granularity medical knowledge during pre-training substantially enhances open-world disease detection and data efficiency in downstream tasks.

Abstract

Vision-language pre-training for chest X-rays has made significant strides, primarily by utilizing paired radiographs and radiology reports. However, existing approaches often face challenges in encoding medical knowledge effectively. While radiology reports provide insights into the current disease manifestation, medical definitions (as used by contemporary methods) tend to be overly abstract, creating a gap in knowledge. To address this, we propose DeViDe, a novel transformer-based method that leverages radiographic descriptions from the open web. These descriptions outline general visual characteristics of diseases in radiographs, and when combined with abstract definitions and radiology reports, provide a holistic snapshot of knowledge. DeViDe incorporates three key features for knowledge-augmented vision language alignment: First, a large-language model-based augmentation is employed to homogenise medical knowledge from diverse sources. Second, this knowledge is aligned with image information at various levels of granularity. Third, a novel projection layer is proposed to handle the complexity of aligning each image with multiple descriptions arising in a multi-label setting. In zero-shot settings, DeViDe performs comparably to fully supervised models on external datasets and achieves state-of-the-art results on three large-scale datasets. Additionally, fine-tuning DeViDe on four downstream tasks and six segmentation tasks showcases its superior performance across data from diverse distributions.
Paper Structure (32 sections, 9 equations, 8 figures, 5 tables)

This paper contains 32 sections, 9 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Our knowledge processing pipeline involves preprocessing reports into entities and observations. Entities are queried in the Radiopaedia database for visual descriptions; if none exist, we generate a synthetic description using a Large Language Model (LLM) based on their definitions in a few-shot manner.
  • Figure 2: The proposed DeViDe framework includes: encoding the image using a visual encoder (Sec. \ref{['sec:image_enc']}), entity knowledge and visual descriptors with a knowledge encoder (Sec. \ref{['sec:knowledge']}), capturing image-level entities-to-image correspondence using ITC loss, and employing Transformer-based Fusion layers for fine-grained alignment between image patches and visual descriptor tokens using the TNC and PTA losses (Sec. \ref{['sec:losses']})
  • Figure 3: The radar chart showcases our method's AUC scores for seven PadChest diseases, demonstrating superior performance compared to previous state-of-the-art. Notably, our approach surpasses CheXNet's fully-supervised performance despite being evaluated in a zero-shot setting.
  • Figure 4: In the PadChest dataset, we evaluated radiographic findings deemed high-importance by a radiologist, each with a sample size exceeding 50, presenting their mean AUC and 95% confidence intervals (CI). To ensure generalization, we externally validated our model on a subset of 39,053 human-annotated chest X-rays from PadChest, with no labeled samples from this set used during training. DeViDe achieved an AUC of at least 0.900 for eight findings and at least 0.700 for 43 out of 57 findings.
  • Figure 5: Evaluation of segmentation ability across organs and few-shot segmentation under full fine-tuning setting. (a) The bar chart compares Dice coefficients of different training approaches for lung, heart, and clavicle segmentation using the JSRT dataset. Our method significantly outperforms others, with the clavicle segmentation task gaining the most. (b) The line graph assesses data efficiency in few-shot learning, focusing on the JSRT clavicle dataset. DeViDe achieves a Dice coefficient close to full dataset performance with just 17 training samples, highlighting its few-shot learning capability.
  • ...and 3 more figures