Table of Contents
Fetching ...

vMFCoOp: Towards Equilibrium on a Unified Hyperspherical Manifold for Prompting Biomedical VLMs

Minye Shao, Sihan Guo, Xinrun Li, Xingyu Miao, Haoran Duan, Yang Long

TL;DR

This work tackles semantic misalignment when prompting biomedical vision-language models by aligning LLM-derived priors with CLIP-based embeddings on a unified hyperspherical manifold. It introduces vMFCoOp, which inversely estimates von Mises–Fisher distributions to create Unified Semantic Anchors and employs three complementary losses—Semantic Anchor Loss, Spherical Contrastive Loss, and Symmetric Cross-Entropy Loss—to steer prompts and image features toward multimodal equilibrium. Across 14 biomedical datasets, 12 imaging modalities, and 13 anatomical regions, vMFCoOp achieves consistent improvements in few-shot accuracy and base-to-novel generalization, while also offering improved interpretability through saliency maps. The method is model-agnostic and scalable to evolving foundation models, with potential for broad clinical deployment and downstream domain adaptations, as evidenced by robust performance and ablations. Future work will extend the framework to more tasks and natural images, deepening theoretical insights and expanding clinical applicability.

Abstract

Recent advances in context optimization (CoOp) guided by large language model (LLM)-distilled medical semantic priors offer a scalable alternative to manual prompt engineering and full fine-tuning for adapting biomedical CLIP-based vision-language models (VLMs). However, prompt learning in this context is challenged by semantic misalignment between LLMs and CLIP variants due to divergent training corpora and model architectures; it further lacks scalability across continuously evolving families of foundation models. More critically, pairwise multimodal alignment via conventional Euclidean-space optimization lacks the capacity to model unified representations or apply localized geometric constraints, which tends to amplify modality gaps in complex biomedical imaging and destabilize few-shot adaptation. In this work, we propose vMFCoOp, a framework that inversely estimates von Mises-Fisher (vMF) distributions on a shared Hyperspherical Manifold, aligning semantic biases between arbitrary LLMs and CLIP backbones via Unified Semantic Anchors to achieve robust biomedical prompting and superior few-shot classification. Grounded in three complementary constraints, vMFCoOp demonstrates consistent improvements across 14 medical datasets, 12 medical imaging modalities, and 13 anatomical regions, outperforming state-of-the-art methods in accuracy, generalization, and clinical applicability. This work aims to continuously expand to encompass more downstream applications, and the corresponding resources are intended to be shared through https://github.com/VinyehShaw/UniEqui.

vMFCoOp: Towards Equilibrium on a Unified Hyperspherical Manifold for Prompting Biomedical VLMs

TL;DR

This work tackles semantic misalignment when prompting biomedical vision-language models by aligning LLM-derived priors with CLIP-based embeddings on a unified hyperspherical manifold. It introduces vMFCoOp, which inversely estimates von Mises–Fisher distributions to create Unified Semantic Anchors and employs three complementary losses—Semantic Anchor Loss, Spherical Contrastive Loss, and Symmetric Cross-Entropy Loss—to steer prompts and image features toward multimodal equilibrium. Across 14 biomedical datasets, 12 imaging modalities, and 13 anatomical regions, vMFCoOp achieves consistent improvements in few-shot accuracy and base-to-novel generalization, while also offering improved interpretability through saliency maps. The method is model-agnostic and scalable to evolving foundation models, with potential for broad clinical deployment and downstream domain adaptations, as evidenced by robust performance and ablations. Future work will extend the framework to more tasks and natural images, deepening theoretical insights and expanding clinical applicability.

Abstract

Recent advances in context optimization (CoOp) guided by large language model (LLM)-distilled medical semantic priors offer a scalable alternative to manual prompt engineering and full fine-tuning for adapting biomedical CLIP-based vision-language models (VLMs). However, prompt learning in this context is challenged by semantic misalignment between LLMs and CLIP variants due to divergent training corpora and model architectures; it further lacks scalability across continuously evolving families of foundation models. More critically, pairwise multimodal alignment via conventional Euclidean-space optimization lacks the capacity to model unified representations or apply localized geometric constraints, which tends to amplify modality gaps in complex biomedical imaging and destabilize few-shot adaptation. In this work, we propose vMFCoOp, a framework that inversely estimates von Mises-Fisher (vMF) distributions on a shared Hyperspherical Manifold, aligning semantic biases between arbitrary LLMs and CLIP backbones via Unified Semantic Anchors to achieve robust biomedical prompting and superior few-shot classification. Grounded in three complementary constraints, vMFCoOp demonstrates consistent improvements across 14 medical datasets, 12 medical imaging modalities, and 13 anatomical regions, outperforming state-of-the-art methods in accuracy, generalization, and clinical applicability. This work aims to continuously expand to encompass more downstream applications, and the corresponding resources are intended to be shared through https://github.com/VinyehShaw/UniEqui.

Paper Structure

This paper contains 32 sections, 10 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: vMFCoOp framework combines structured prompts from an arbitrary LLM, learnable context tokens, and any biomedical CLIP variant to model complex medical images under few-shot settings, with vMF inverse mapping estimation enabling unified semantic alignment on the manifold, guided by 3 constraints. A.Unified semantic anchor construction and prompt optimization on the hyperspherical manifold. B.Low-dimensional overview of the aligned framework for intuitive understanding: vMFCoOp seeks equilibrium on a unified hyperspherical manifold by calibrating priors from heterogeneous, evolving VLM/LLM families through estimation and anchoring that stabilizes directional semantics, and unified optimization within the non-Euclidean hyperspherical space, thereby reconciling cross-model semantic biases, accommodating fine-grained biomedical variability, and stabilizing few-shot clinical adaptation. C.Details of the $\mathcal{L}_\mathit{\boldsymbol{sc}}$ loss: Decision boundaries evolve from initial equal-angle partitions to large-margin separations as the temperature $\tau$ is annealed. Within-class representations are refined from broad angular spread to compact clusters. For visualization, vectors are projected onto their $S^1$ subspace (angles preserved), while optimization is performed on the full hypersphere $S^{d-1}$.
  • Figure 2: Effect of prompting variations on saliency maps, where (a)–(e) illustrate different strategies (zoom in for details). Note that the fourth row depicts a rare cardiac cine MRI case in which the patient has a posterior mediastinal tumor, simulating a few-shot fine-tuning scenario with limited data and reflecting real clinical settings. vMFCoOp (ours) successfully localizes the approximate lesion region under such challenging conditions, while other methods, such as BiomedCoOp, tend to focus their attention on the cardiac area. This may occur because they fail to capture the concept of the posterior mediastinum when classifying “Malignant neoplasm of heart, mediastinum, and pleura”, or because their attention is overly biased toward the heart region due to semantic inductive bias, leading to overfitting or semantic misalignment.
  • Figure 3: Few-shot performance comparison under different backbone configurations using 50 LLM-derived prompts. Black dashed line and triangle denote BiomedCoOp.