A Zero-shot Explainable Doctor Ranking Framework with Large Language Models
Ziyang Zeng, Dongyuan Li, Yuqing Yang
TL;DR
The paper tackles the challenge of ranking doctors for specific medical needs in a data-scarce setting by introducing a zero-shot, explainable doctor ranking framework that uses large language models to generate disease-specific ranking criteria and step-by-step rationales. It introduces DrRank, a professionally annotated benchmark with 38 disease-treatment pairs and 4,325 doctor profiles, and demonstrates significant gains over baselines as well as cross-domain generalization on BEIR datasets. The approach employs fine-grained five-level relevance labels, a forced elicitation mechanism to stabilize outputs, and interpretable explanations, with evidence from medical experts supporting reliability and trustworthiness. The work also analyzes fairness, scalability via low-bit quantization, and robustness, underscoring practical potential for trustworthy, real-world doctor recommendations while outlining avenues for real-system integration and domain knowledge augmentation.
Abstract
Online medical service provides patients convenient access to doctors, but effectively ranking doctors based on specific medical needs remains challenging. Current ranking approaches typically lack the interpretability crucial for patient trust and informed decision-making. Additionally, the scarcity of standardized benchmarks and labeled data for supervised learning impedes progress in expertise-aware doctor ranking. To address these challenges, we propose an explainable ranking framework for doctor ranking powered by large language models in a zero-shot setting. Our framework dynamically generates disease-specific ranking criteria to guide the large language model in assessing doctor relevance with transparency and consistency. It further enhances interpretability by generating step-by-step rationales for its ranking decisions, improving the overall explainability of the information retrieval process. To support rigorous evaluation, we built and released DrRank, a novel expertise-driven dataset comprising 38 disease-treatment pairs and 4,325 doctor profiles. On this benchmark, our framework significantly outperforms the strongest baseline by +6.45 NDCG@10. Comprehensive analyses also show our framework is fair across disease types, patient gender, and geographic regions. Furthermore, verification by medical experts confirms the reliability and interpretability of our approach, reinforcing its potential for trustworthy, real-world doctor recommendation. To demonstrate its broader applicability, we validate our framework on two datasets from BEIR benchmark, where it again achieves superior performance. The code and associated data are available at: https://github.com/YangLab-BUPT/DrRank.
