CardioEmbed: Domain-Specialized Text Embeddings for Clinical Cardiology
Richard J. Young, Alice M. Matthews
TL;DR
CardioEmbed demonstrates that domain-specific embeddings trained on comprehensive clinical textbooks significantly improve cardiology-focused semantic retrieval and retrieval-related tasks. By fine-tuning a strong foundation model (Qwen3-Embedding-8B) on ~150,000 cardiology textbook sentences using contrastive learning with InfoNCE, and employing EOS pooling and LoRA-based efficiency, the model achieves near-perfect cardiology retrieval (Acc@1 of 99.60%), substantially outperforming PubMed-centric baselines. While maintaining competitive performance on broader biomedical benchmarks (MTEB BIOSSES 0.7748, SciFact 0.609), CardioEmbed highlights the value of depth in domain knowledge for specialized clinical applications. The work suggests that textbook-based domain specialization can meaningfully improve clinical information retrieval and decision-support related tasks, though real-world deployment requires further validation and integration with clinical reasoning systems.
Abstract
Biomedical text embeddings have primarily been developed using research literature from PubMed, yet clinical cardiology practice relies heavily on procedural knowledge and specialized terminology found in comprehensive textbooks rather than research abstracts. This research practice gap limits the effectiveness of existing embedding models for clinical applications incardiology. This study trained CardioEmbed, a domain-specialized embedding model based on Qwen3-Embedding-8B, using contrastive learning on a curated corpus of seven comprehensive cardiology textbooks totaling approximately 150,000 sentences after deduplication. The model employs InfoNCE loss with in-batch negatives and achieves 99.60% retrieval accuracy on cardiac-specific semantic retrieval tasks, a +15.94 percentage point improvement over MedTE, the current state-of-the-art medical embedding model. On MTEB medical benchmarks, the model obtained BIOSSES 0.77 Spearman and SciFact 0.61 NDCG@10, indicating competitive performance on related biomedical domains. Domain-specialized training on comprehensive clinical textbooks yields near-perfect cardiology retrieval (99.60% Acc@1), improving over MedTE by +15.94 percentage points.
