How do you know that? Teaching Generative Language Models to Reference Answers to Biomedical Questions
Bojana Bašaragin, Adela Ljajić, Darija Medvecki, Lorenzo Cassano, Miloš Košprdić, Nikola Milošević
TL;DR
The paper tackles the reliability challenge of biomedical QA by proposing a retrieval-augmented generation system grounded in PubMed abstracts, with per-claim references. It combines a hybrid lexical-semantic IR over PubMed with fine-tuned Mistral-7B-Instruct models using QLoRA on the PQAref dataset, releasing both adapters and the dataset. Results show that the hybrid IR outperforms PubMed on standard retrieval metrics, and fine-tuned models achieve recall and referencing performance approaching GPT-4 Turbo while exhibiting fewer hallucinated IDs than zero-shot baselines. The work highlights practical implications for privacy-preserving, verifiable biomedical QA and outlines directions for further domain-specific embedding tuning and automated citation evaluation.
Abstract
Large language models (LLMs) have recently become the leading source of answers for users' questions online. Despite their ability to offer eloquent answers, their accuracy and reliability can pose a significant challenge. This is especially true for sensitive domains such as biomedicine, where there is a higher need for factually correct answers. This paper introduces a biomedical retrieval-augmented generation (RAG) system designed to enhance the reliability of generated responses. The system is based on a fine-tuned LLM for the referenced question-answering, where retrieved relevant abstracts from PubMed are passed to LLM's context as input through a prompt. Its output is an answer based on PubMed abstracts, where each statement is referenced accordingly, allowing the users to verify the answer. Our retrieval system achieves an absolute improvement of 23% compared to the PubMed search engine. Based on the manual evaluation on a small sample, our fine-tuned LLM component achieves comparable results to GPT-4 Turbo in referencing relevant abstracts. We make the dataset used to fine-tune the models and the fine-tuned models based on Mistral-7B-instruct-v0.1 and v0.2 publicly available.
