Enhancing Medical Cross-Modal Hashing Retrieval using Dropout-Voting Mixture-of-Experts Fusion
Jaewon Ahn, Woosung Jang, Beakcheol Jang
TL;DR
The paper tackles scalable cross-modal medical retrieval by introducing MCMFH, a CLIP-based hashing framework that fuses medical image and text modalities through a frozen Dropout Voting MLP and a Mixture-of-Experts Fusion Transformer guided by a hybrid gating loss. It maps medical data via BiomedCLIP embeddings, produces compact 16-bit hash codes through a hashing MLP, and optimizes a fusion/hash objective with fusion and contrastive losses. Empirical results on Open-i and ROCO non-radiology show that MCMFH surpasses state-of-the-art CLIP-based CMHR models at 16-bit codes, with substantial gains in mAP and improved retrieval efficiency. Ablation studies confirm the contribution of MoE, dropout voting, and hybrid gating loss to robustness and performance, underscoring the method’s practicality for clinical deployment with limited memory and compute.
Abstract
In recent years, cross-modal retrieval using images and text has become an active area of research, especially in the medical domain. The abundance of data in various modalities in this field has led to a growing importance of cross-modal retrieval for efficient image interpretation, data-driven diagnostic support, and medical education. In the context of the increasing integration of distributed medical data across healthcare facilities with the objective of enhancing interoperability, it is imperative to optimize the performance of retrieval systems in terms of the speed, memory efficiency, and accuracy of the retrieved data. This necessity arises in response to the substantial surge in data volume that characterizes contemporary medical practices. In this study, we propose a novel framework that incorporates dropout voting and mixture-of-experts (MoE) based contrastive fusion modules into a CLIP-based cross-modal hashing retrieval structure. We also propose the application of hybrid loss. So we now call our model MCMFH which is a medical cross-modal fusion hashing retrieval. Our method enables the simultaneous achievement of high accuracy and fast retrieval speed in low-memory environments. The model is demonstrated through experiments on radiological and non-radiological medical datasets.
