Distilling Closed-Source LLM's Knowledge for Locally Stable and Economic Biomedical Entity Linking
Yihao Ai, Zhiyuan Ning, Weiwei Dai, Pengfei Wang, Yi Du, Wenjuan Cui, Kunpeng Liu, Yuanchun Zhou
TL;DR
This work tackles biomedical entity linking under resource constraints by addressing stability and cost issues of closed-source LLMs. It introduces RPDR, a three-step framework that combines retrieval, prompting-based data generation from closed LLMs, and distillation to fine-tune open-source LLMs for re-ranking, enabling local deployment. The approach demonstrates improved Acc@1 on Chinese and English datasets, with notable gains when training data is scarce, and shows meaningful cost savings through local inference. Key contributions include redesigning the traditional two-step pipeline, applying knowledge distillation to biomedical entity linking, and validating the method across real-world and public multilingual datasets. The findings highlight RPDR’s potential for stable, scalable, and cross-domain biomedical NLP applications without relying on continual API access to closed models.
Abstract
Biomedical entity linking aims to map nonstandard entities to standard entities in a knowledge base. Traditional supervised methods perform well but require extensive annotated data to transfer, limiting their usage in low-resource scenarios. Large language models (LLMs), especially closed-source LLMs, can address these but risk stability issues and high economic costs: using these models is restricted by commercial companies and brings significant economic costs when dealing with large amounts of data. To address this, we propose ``RPDR'', a framework combining closed-source LLMs and open-source LLMs for re-ranking candidates retrieved by a retriever fine-tuned with a small amount of data. By prompting a closed-source LLM to generate training data from unannotated data and fine-tuning an open-source LLM for re-ranking, we effectively distill the knowledge to the open-source LLM that can be deployed locally, thus avoiding the stability issues and the problem of high economic costs. We evaluate RPDR on two datasets, including one real-world dataset and one publicly available dataset involving two languages: Chinese and English. RPDR achieves 0.019 Acc@1 improvement and 0.036 Acc@1 improvement on the Aier dataset and the Ask A Patient dataset when the amount of training data is not enough. The results demonstrate the superiority and generalizability of the proposed framework.
