Biomedical Entity Linking as Multiple Choice Question Answering
Zhenxi Lin, Ziheng Zhang, Xian Wu, Yefeng Zheng
TL;DR
BioELQA tackles the challenge of fine-grained and long-tailed biomedical entity linking by reframing BioEL as a Multiple Choice Question Answering task. It employs a bi-encoder retriever to propose top-$N$ candidate entities, a generator that outputs the symbol of the chosen candidate via a retrieval-enhanced MCP prompt, and a $k$NN module to bring in similar training instances as contextual cues. Empirical results on NCBI, BC5CDR, and COMETA show state-of-the-art accuracy, with ablations confirming the contributions of data augmentation and the retrieval memory. The approach explicitly models both mention-entity and entity-entity interactions and demonstrates improved robustness for long-tailed and morphologically similar entities, offering a practical, scalable solution for BioEL without relying on external synonym corpora. The work suggests future directions to incorporate contextual disambiguation alongside the retrieval-augmented framework.
Abstract
Although biomedical entity linking (BioEL) has made significant progress with pre-trained language models, challenges still exist for fine-grained and long-tailed entities. To address these challenges, we present BioELQA, a novel model that treats Biomedical Entity Linking as Multiple Choice Question Answering. BioELQA first obtains candidate entities with a fast retriever, jointly presents the mention and candidate entities to a generator, and then outputs the predicted symbol associated with its chosen entity. This formulation enables explicit comparison of different candidate entities, thus capturing fine-grained interactions between mentions and entities, as well as among entities themselves. To improve generalization for long-tailed entities, we retrieve similar labeled training instances as clues and concatenate the input with retrieved instances for the generator. Extensive experimental results show that BioELQA outperforms state-of-the-art baselines on several datasets.
