Table of Contents
Fetching ...

K-MaT: Knowledge-Anchored Manifold Transport for Cross-Modal Prompt Learning in Medical Imaging

Jiajun Zeng, Shadi Albarqouni

TL;DR

K-MaT (Knowledge-Anchored Manifold Transport), a prompt-learning framework that transfers decision structures to low-end modalities without requiring low-end training images, is proposed, which mitigates the catastrophic forgetting seen in standard methods like CoOp.

Abstract

Large-scale biomedical vision-language models (VLMs) adapted on high-end imaging (e.g., CT) often fail to transfer to frontline low-end modalities (e.g., radiography), collapsing into modality-specific shortcuts. We propose K-MaT (Knowledge-Anchored Manifold Transport), a prompt-learning framework that transfers decision structures to low-end modalities without requiring low-end training images. K-MaT factorizes prompts, anchors them to clinical text descriptions, and aligns the low-end prompt manifold to the visually-grounded high-end space using Fused Gromov-Wasserstein optimal transport. We evaluate K-MaT on four cross-modal benchmarks, including dermoscopy, mammography to ultrasound, and CT to chest X-ray. K-MaT achieves state-of-the-art results, improving the average harmonic mean of accuracy to 44.1% (from BiomedCoOp's 42.0%) and macro-F1 to 36.2%. Notably, on the challenging breast imaging task, it mitigates the catastrophic forgetting seen in standard methods like CoOp (which drops to 27.0% accuracy on the low-end), preserving robust performance across modalities. Aligning prompt manifolds via optimal transport provides a highly effective route for the zero-shot cross-modal deployment of medical VLMs.

K-MaT: Knowledge-Anchored Manifold Transport for Cross-Modal Prompt Learning in Medical Imaging

TL;DR

K-MaT (Knowledge-Anchored Manifold Transport), a prompt-learning framework that transfers decision structures to low-end modalities without requiring low-end training images, is proposed, which mitigates the catastrophic forgetting seen in standard methods like CoOp.

Abstract

Large-scale biomedical vision-language models (VLMs) adapted on high-end imaging (e.g., CT) often fail to transfer to frontline low-end modalities (e.g., radiography), collapsing into modality-specific shortcuts. We propose K-MaT (Knowledge-Anchored Manifold Transport), a prompt-learning framework that transfers decision structures to low-end modalities without requiring low-end training images. K-MaT factorizes prompts, anchors them to clinical text descriptions, and aligns the low-end prompt manifold to the visually-grounded high-end space using Fused Gromov-Wasserstein optimal transport. We evaluate K-MaT on four cross-modal benchmarks, including dermoscopy, mammography to ultrasound, and CT to chest X-ray. K-MaT achieves state-of-the-art results, improving the average harmonic mean of accuracy to 44.1% (from BiomedCoOp's 42.0%) and macro-F1 to 36.2%. Notably, on the challenging breast imaging task, it mitigates the catastrophic forgetting seen in standard methods like CoOp (which drops to 27.0% accuracy on the low-end), preserving robust performance across modalities. Aligning prompt manifolds via optimal transport provides a highly effective route for the zero-shot cross-modal deployment of medical VLMs.
Paper Structure (9 sections, 5 equations, 2 figures, 3 tables)

This paper contains 9 sections, 5 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Overall architecture of the proposed K-MaT framework. During training, high-end textual prompts are optimized using high-end imaging data ($\mathcal{L}_{ce}$), while prompts from both modalities are anchored to LLM-generated clinical descriptions ($\mathcal{L}_{anc}$). The low-end prompt manifold is strictly aligned to the visually grounded high-end space via $\mathcal{L}_{fgw}$. During inference, the frozen visual encoder extracts features from unseen low-end images, and predictions are computed via visual-textual similarity with the learned low-end embeddings, bypassing low-end visual training entirely.
  • Figure 2: (a). Visualization of textual embeddings w/ and w/o $\mathcal{L}_{fgw}$ on the Chest cross-modal generalization task. (b). The breakdown of Accuracy for each class.