vMFCoOp: Towards Equilibrium on a Unified Hyperspherical Manifold for Prompting Biomedical VLMs
Minye Shao, Sihan Guo, Xinrun Li, Xingyu Miao, Haoran Duan, Yang Long
TL;DR
This work tackles semantic misalignment when prompting biomedical vision-language models by aligning LLM-derived priors with CLIP-based embeddings on a unified hyperspherical manifold. It introduces vMFCoOp, which inversely estimates von Mises–Fisher distributions to create Unified Semantic Anchors and employs three complementary losses—Semantic Anchor Loss, Spherical Contrastive Loss, and Symmetric Cross-Entropy Loss—to steer prompts and image features toward multimodal equilibrium. Across 14 biomedical datasets, 12 imaging modalities, and 13 anatomical regions, vMFCoOp achieves consistent improvements in few-shot accuracy and base-to-novel generalization, while also offering improved interpretability through saliency maps. The method is model-agnostic and scalable to evolving foundation models, with potential for broad clinical deployment and downstream domain adaptations, as evidenced by robust performance and ablations. Future work will extend the framework to more tasks and natural images, deepening theoretical insights and expanding clinical applicability.
Abstract
Recent advances in context optimization (CoOp) guided by large language model (LLM)-distilled medical semantic priors offer a scalable alternative to manual prompt engineering and full fine-tuning for adapting biomedical CLIP-based vision-language models (VLMs). However, prompt learning in this context is challenged by semantic misalignment between LLMs and CLIP variants due to divergent training corpora and model architectures; it further lacks scalability across continuously evolving families of foundation models. More critically, pairwise multimodal alignment via conventional Euclidean-space optimization lacks the capacity to model unified representations or apply localized geometric constraints, which tends to amplify modality gaps in complex biomedical imaging and destabilize few-shot adaptation. In this work, we propose vMFCoOp, a framework that inversely estimates von Mises-Fisher (vMF) distributions on a shared Hyperspherical Manifold, aligning semantic biases between arbitrary LLMs and CLIP backbones via Unified Semantic Anchors to achieve robust biomedical prompting and superior few-shot classification. Grounded in three complementary constraints, vMFCoOp demonstrates consistent improvements across 14 medical datasets, 12 medical imaging modalities, and 13 anatomical regions, outperforming state-of-the-art methods in accuracy, generalization, and clinical applicability. This work aims to continuously expand to encompass more downstream applications, and the corresponding resources are intended to be shared through https://github.com/VinyehShaw/UniEqui.
