DisEmbed: Transforming Disease Understanding through Embeddings
Salman Faroz
TL;DR
DisEmbed targets disease-specific understanding by addressing the limitations of broad medical embeddings in capturing nuanced disease–symptom relationships. It trains a disease-focused embedding with a synthetic dataset of disease descriptions, symptoms, and disease-related QA pairs using anchor–positive pairs and the Multiple Negatives Ranking Loss ($\text{MNRL}$). Evaluations on disease-centric benchmarks with triplet-based metrics show DisEmbed outperforms larger medical-domain models in identifying disease contexts and distinguishing similar diseases, with strong performance in retrieval-oriented tasks such as RAG. The work highlights the practical value of compact, disease-focused embeddings for clinical retrieval and decision-support, while acknowledging dataset biases and scope limitations; future work will expand the disease coverage and balance general medical knowledge, with public resources available on HuggingFace.
Abstract
The medical domain is vast and diverse, with many existing embedding models focused on general healthcare applications. However, these models often struggle to capture a deep understanding of diseases due to their broad generalization across the entire medical field. To address this gap, I present DisEmbed, a disease-focused embedding model. DisEmbed is trained on a synthetic dataset specifically curated to include disease descriptions, symptoms, and disease-related Q\&A pairs, making it uniquely suited for disease-related tasks. For evaluation, I benchmarked DisEmbed against existing medical models using disease-specific datasets and the triplet evaluation method. My results demonstrate that DisEmbed outperforms other models, particularly in identifying disease-related contexts and distinguishing between similar diseases. This makes DisEmbed highly valuable for disease-specific use cases, including retrieval-augmented generation (RAG) tasks, where its performance is particularly robust.
