Using Pretrained Large Language Model with Prompt Engineering to Answer Biomedical Questions

Wenxin Zhou; Thuy Hang Ngo

Using Pretrained Large Language Model with Prompt Engineering to Answer Biomedical Questions

Wenxin Zhou, Thuy Hang Ngo

TL;DR

The paper tackles scalable biomedical QA by integrating retrieval and answer generation via pretrained LLMs guided by prompt engineering and response post-processing. It proposes a two-level IR pipeline (query construction, reranking, snippet extraction) and a context-rich QA module that uses 1000-word snippet contexts and few-shot prompts, with post-processing to enforce exact/ideal answer formats. Evaluating multiple LLMs on BioASQ Task 12B and Synergy, Mixtral 47B emerges as the strongest model, achieving competitive metrics including 0.14 MAP for document retrieval and 0.96 F1 for yes/no QA, while highlighting the importance of context and post-processing in reducing hallucinations. The work identifies practical considerations for scalable biomedical QA, such as leveraging vector databases for larger initial retrieval, exploring on-the-fly vs precomputed embeddings, and potentially fine-tuning with LoRA on domain data.

Abstract

Our team participated in the BioASQ 2024 Task12b and Synergy tasks to build a system that can answer biomedical questions by retrieving relevant articles and snippets from the PubMed database and generating exact and ideal answers. We propose a two-level information retrieval and question-answering system based on pre-trained large language models (LLM), focused on LLM prompt engineering and response post-processing. We construct prompts with in-context few-shot examples and utilize post-processing techniques like resampling and malformed response detection. We compare the performance of various pre-trained LLM models on this challenge, including Mixtral, OpenAI GPT and Llama2. Our best-performing system achieved 0.14 MAP score on document retrieval, 0.05 MAP score on snippet retrieval, 0.96 F1 score for yes/no questions, 0.38 MRR score for factoid questions and 0.50 F1 score for list questions in Task 12b.

Using Pretrained Large Language Model with Prompt Engineering to Answer Biomedical Questions

TL;DR

Abstract

Using Pretrained Large Language Model with Prompt Engineering to Answer Biomedical Questions

Authors

TL;DR

Abstract

Table of Contents

Figures (3)