Table of Contents
Fetching ...

RadZero: Similarity-Based Cross-Attention for Explainable Vision-Language Alignment in Chest X-ray with Zero-Shot Multi-Task Capability

Jonggwon Park, Byungmu Yoon, Soobum Kim, Kyoyun Choi

TL;DR

RadZero tackles the need for interpretable, zero-shot vision-language alignment in chest X-ray by introducing VL-CABS, a cosine-similarity-based cross-attention that aligns text descriptions with local image patches and yields pixel-level VL similarity maps for grounding and open-vocabulary segmentation. The framework leverages multi-positive contrastive learning and a frozen high-resolution vision encoder with a small trainable Transformer head, while using an LLM to extract concise finding-sentences from radiology reports to create multiple positive pairs. Empirical results on public CXR benchmarks show state-of-the-art zero-shot classification, grounding, and segmentation, with VL similarity maps providing transparent, spatial explanations of model decisions. The work highlights the potential of similarity-based VL reasoning to improve explainability and open-vocabulary medical image understanding, and publicly releases code for reproducibility.

Abstract

Recent advancements in multimodal models have significantly improved vision-language (VL) alignment in radiology. However, existing approaches struggle to effectively utilize complex radiology reports for learning and offer limited interpretability through attention probability visualizations. To address these challenges, we introduce $\textbf{RadZero}$, a novel framework for VL alignment in chest X-ray with zero-shot multi-task capability. A key component of our approach is $\textbf{VL-CABS}$ ($\textbf{V}$ision-$\textbf{L}$anguage $\textbf{C}$ross-$\textbf{A}$ttention $\textbf{B}$ased on $\textbf{S}$imilarity), which aligns text embeddings with local image features for interpretable, fine-grained VL reasoning. RadZero leverages large language models to extract concise semantic sentences from radiology reports and employs multi-positive contrastive training to effectively capture relationships between images and multiple relevant textual descriptions. It uses a pre-trained vision encoder with additional trainable Transformer layers, allowing efficient high-resolution image processing. By computing similarity between text embeddings and local image patch features, VL-CABS enables zero-shot inference with similarity probability for classification, and pixel-level VL similarity maps for grounding and segmentation. Experimental results on public chest radiograph benchmarks show that RadZero outperforms state-of-the-art methods in zero-shot classification, grounding, and segmentation. Furthermore, VL similarity map analysis highlights the potential of VL-CABS for improving explainability in VL alignment. Additionally, qualitative evaluation demonstrates RadZero's capability for open-vocabulary semantic segmentation, further validating its effectiveness in medical imaging. Code is available at $\href{https://github.com/deepnoid-ai/RadZero}{https://github.com/deepnoid-ai/RadZero}$.

RadZero: Similarity-Based Cross-Attention for Explainable Vision-Language Alignment in Chest X-ray with Zero-Shot Multi-Task Capability

TL;DR

RadZero tackles the need for interpretable, zero-shot vision-language alignment in chest X-ray by introducing VL-CABS, a cosine-similarity-based cross-attention that aligns text descriptions with local image patches and yields pixel-level VL similarity maps for grounding and open-vocabulary segmentation. The framework leverages multi-positive contrastive learning and a frozen high-resolution vision encoder with a small trainable Transformer head, while using an LLM to extract concise finding-sentences from radiology reports to create multiple positive pairs. Empirical results on public CXR benchmarks show state-of-the-art zero-shot classification, grounding, and segmentation, with VL similarity maps providing transparent, spatial explanations of model decisions. The work highlights the potential of similarity-based VL reasoning to improve explainability and open-vocabulary medical image understanding, and publicly releases code for reproducibility.

Abstract

Recent advancements in multimodal models have significantly improved vision-language (VL) alignment in radiology. However, existing approaches struggle to effectively utilize complex radiology reports for learning and offer limited interpretability through attention probability visualizations. To address these challenges, we introduce , a novel framework for VL alignment in chest X-ray with zero-shot multi-task capability. A key component of our approach is (ision-anguage ross-ttention ased on imilarity), which aligns text embeddings with local image features for interpretable, fine-grained VL reasoning. RadZero leverages large language models to extract concise semantic sentences from radiology reports and employs multi-positive contrastive training to effectively capture relationships between images and multiple relevant textual descriptions. It uses a pre-trained vision encoder with additional trainable Transformer layers, allowing efficient high-resolution image processing. By computing similarity between text embeddings and local image patch features, VL-CABS enables zero-shot inference with similarity probability for classification, and pixel-level VL similarity maps for grounding and segmentation. Experimental results on public chest radiograph benchmarks show that RadZero outperforms state-of-the-art methods in zero-shot classification, grounding, and segmentation. Furthermore, VL similarity map analysis highlights the potential of VL-CABS for improving explainability in VL alignment. Additionally, qualitative evaluation demonstrates RadZero's capability for open-vocabulary semantic segmentation, further validating its effectiveness in medical imaging. Code is available at .

Paper Structure

This paper contains 47 sections, 2 equations, 10 figures, 20 tables.

Figures (10)

  • Figure 1: Comparison of attention maps and the proposed VL similarity map for visualizing VL alignment. (a) While traditional attention maps inevitably exhibit high values at certain points due to softmax activation, the proposed VL similarity maps yield low values for unrelated image-text pair. (b) Their fixed scale, originating from cosine similarity, enables open-vocabulary semantic segmentation through simple thresholding.
  • Figure 2: Zero-shot multi-task performance. Each score is averaged over multiple datasets per task.
  • Figure 3: The overall framework of RadZero. (a) Finding-sentence extraction using an LLM. (b) Computation of the similarity logit, $l_{ij}^n$, between image $I_{i}$ and finding-sentence $S_{j}^n$. W-sum and cos-sim denote weighted sum and cosine similarity, respectively. (c) Computation of MP-NCE loss ($\mathcal{L}_I$) and InfoNCE loss ($\mathcal{L}_T$) from the similarity logit matrix. (d) Zero-shot inference pipeline.
  • Figure 4: VL similarity maps of CXR images from CXD10, representing (a) normal, (b) fibrosis, and (c) effusion in the right lung. The value at the top-right corner represent the similarity probability $\hat{l}$ between each CXR image and the text prompt (bottom-right corner).
  • Figure 5: Open-vocabulary semantic segmentation results: (a), (b) for findings and (c) for anatomical regions. The CXR images and bounding box labels are from CXD10. The segmentation thresholds were set to 0.7 for (a) and (b), and 0.4 for (c).
  • ...and 5 more figures