Table of Contents
Fetching ...

Distilling Closed-Source LLM's Knowledge for Locally Stable and Economic Biomedical Entity Linking

Yihao Ai, Zhiyuan Ning, Weiwei Dai, Pengfei Wang, Yi Du, Wenjuan Cui, Kunpeng Liu, Yuanchun Zhou

TL;DR

This work tackles biomedical entity linking under resource constraints by addressing stability and cost issues of closed-source LLMs. It introduces RPDR, a three-step framework that combines retrieval, prompting-based data generation from closed LLMs, and distillation to fine-tune open-source LLMs for re-ranking, enabling local deployment. The approach demonstrates improved Acc@1 on Chinese and English datasets, with notable gains when training data is scarce, and shows meaningful cost savings through local inference. Key contributions include redesigning the traditional two-step pipeline, applying knowledge distillation to biomedical entity linking, and validating the method across real-world and public multilingual datasets. The findings highlight RPDR’s potential for stable, scalable, and cross-domain biomedical NLP applications without relying on continual API access to closed models.

Abstract

Biomedical entity linking aims to map nonstandard entities to standard entities in a knowledge base. Traditional supervised methods perform well but require extensive annotated data to transfer, limiting their usage in low-resource scenarios. Large language models (LLMs), especially closed-source LLMs, can address these but risk stability issues and high economic costs: using these models is restricted by commercial companies and brings significant economic costs when dealing with large amounts of data. To address this, we propose ``RPDR'', a framework combining closed-source LLMs and open-source LLMs for re-ranking candidates retrieved by a retriever fine-tuned with a small amount of data. By prompting a closed-source LLM to generate training data from unannotated data and fine-tuning an open-source LLM for re-ranking, we effectively distill the knowledge to the open-source LLM that can be deployed locally, thus avoiding the stability issues and the problem of high economic costs. We evaluate RPDR on two datasets, including one real-world dataset and one publicly available dataset involving two languages: Chinese and English. RPDR achieves 0.019 Acc@1 improvement and 0.036 Acc@1 improvement on the Aier dataset and the Ask A Patient dataset when the amount of training data is not enough. The results demonstrate the superiority and generalizability of the proposed framework.

Distilling Closed-Source LLM's Knowledge for Locally Stable and Economic Biomedical Entity Linking

TL;DR

This work tackles biomedical entity linking under resource constraints by addressing stability and cost issues of closed-source LLMs. It introduces RPDR, a three-step framework that combines retrieval, prompting-based data generation from closed LLMs, and distillation to fine-tune open-source LLMs for re-ranking, enabling local deployment. The approach demonstrates improved Acc@1 on Chinese and English datasets, with notable gains when training data is scarce, and shows meaningful cost savings through local inference. Key contributions include redesigning the traditional two-step pipeline, applying knowledge distillation to biomedical entity linking, and validating the method across real-world and public multilingual datasets. The findings highlight RPDR’s potential for stable, scalable, and cross-domain biomedical NLP applications without relying on continual API access to closed models.

Abstract

Biomedical entity linking aims to map nonstandard entities to standard entities in a knowledge base. Traditional supervised methods perform well but require extensive annotated data to transfer, limiting their usage in low-resource scenarios. Large language models (LLMs), especially closed-source LLMs, can address these but risk stability issues and high economic costs: using these models is restricted by commercial companies and brings significant economic costs when dealing with large amounts of data. To address this, we propose ``RPDR'', a framework combining closed-source LLMs and open-source LLMs for re-ranking candidates retrieved by a retriever fine-tuned with a small amount of data. By prompting a closed-source LLM to generate training data from unannotated data and fine-tuning an open-source LLM for re-ranking, we effectively distill the knowledge to the open-source LLM that can be deployed locally, thus avoiding the stability issues and the problem of high economic costs. We evaluate RPDR on two datasets, including one real-world dataset and one publicly available dataset involving two languages: Chinese and English. RPDR achieves 0.019 Acc@1 improvement and 0.036 Acc@1 improvement on the Aier dataset and the Ask A Patient dataset when the amount of training data is not enough. The results demonstrate the superiority and generalizability of the proposed framework.

Paper Structure

This paper contains 18 sections, 4 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Overall framework of our method. Candidates are generated by a bi-encoder. In the training step, the candidates are re-ranked by closed-source LLMs and used for training open-source LLMs. When inferencing, the candidates are re-ranked by the fine-tuned specialized LLMs.
  • Figure 2: Illustration of the designed prompt.
  • Figure 3: Illustration of the designed instruction used for fine-tuning BenTsao.
  • Figure 4: The performances of the fine-tuned LLM with different numbers of training data.