Table of Contents
Fetching ...

OLAPH: Improving Factuality in Biomedical Long-form Question Answering

Minbyul Jeong, Hyeon Hwang, Chanwoong Yoon, Taewhoo Lee, Jaewoo Kang

TL;DR

These findings reveal that a 7B LLM trained with the OLAPH framework can provide long answers comparable to the medical experts' answers in terms of factuality, and highlight that, even on evaluation metrics not used during training, LLMs trained with the OLAPH framework demonstrate significant performance improvement in factuality.

Abstract

In the medical domain, numerous scenarios necessitate the long-form generation ability of large language models (LLMs). Specifically, when addressing patients' questions, it is essential that the model's response conveys factual claims, highlighting the need for an automated method to evaluate those claims. Thus, we introduce MedLFQA, a benchmark dataset reconstructed using long-form question-answering datasets related to the biomedical domain. We use MedLFQA to facilitate a cost-effective automatic evaluations of factuality. We also propose OLAPH, a simple and novel framework that utilizes cost-effective and multifaceted automatic evaluation to construct a synthetic preference set and answers questions in our preferred manner. Our framework leads us to train LLMs step-by-step to reduce hallucinations and include crucial medical claims. We highlight that, even on evaluation metrics not used during training, LLMs trained with our OLAPH framework demonstrate significant performance improvement in factuality. Our findings reveal that a 7B LLM trained with our OLAPH framework can provide long answers comparable to the medical experts' answers in terms of factuality. We believe that our work could shed light on gauging the long-text generation ability of LLMs in the medical domain. Our code and datasets are available.

OLAPH: Improving Factuality in Biomedical Long-form Question Answering

TL;DR

These findings reveal that a 7B LLM trained with the OLAPH framework can provide long answers comparable to the medical experts' answers in terms of factuality, and highlight that, even on evaluation metrics not used during training, LLMs trained with the OLAPH framework demonstrate significant performance improvement in factuality.

Abstract

In the medical domain, numerous scenarios necessitate the long-form generation ability of large language models (LLMs). Specifically, when addressing patients' questions, it is essential that the model's response conveys factual claims, highlighting the need for an automated method to evaluate those claims. Thus, we introduce MedLFQA, a benchmark dataset reconstructed using long-form question-answering datasets related to the biomedical domain. We use MedLFQA to facilitate a cost-effective automatic evaluations of factuality. We also propose OLAPH, a simple and novel framework that utilizes cost-effective and multifaceted automatic evaluation to construct a synthetic preference set and answers questions in our preferred manner. Our framework leads us to train LLMs step-by-step to reduce hallucinations and include crucial medical claims. We highlight that, even on evaluation metrics not used during training, LLMs trained with our OLAPH framework demonstrate significant performance improvement in factuality. Our findings reveal that a 7B LLM trained with our OLAPH framework can provide long answers comparable to the medical experts' answers in terms of factuality. We believe that our work could shed light on gauging the long-text generation ability of LLMs in the medical domain. Our code and datasets are available.
Paper Structure (31 sections, 5 equations, 8 figures, 13 tables)

This paper contains 31 sections, 5 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: Current LFQA benchmark datasets lack comprehensive evaluation criteria, featuring just a pair of questions and answers (or not even an answer). In MedLFQA, we provide GPT-4 generated answers as well as two crucial statements to address this limitation. For instance, a well-generated GPT-4 response provides information on the definition, advantages, disadvantages, and side effects of Lexapro in response to a patient's inquiry about it. Additionally, the answers and statements are structured to enable assessment of how closely the LLM response aligns with the correct answer in terms of multifaceted automatic evaluation: factuality, semantic similarity, and word composition.
  • Figure 2: Pairwise evaluation from the medical experts. A higher percentage indicates better quality for the top 4 rows and the opposite for the bottom 5 rows. We use ✓ for better quality of GPT-4 generated answers compared to the human annotated answers.
  • Figure 3: Overall Olaph framework. We iteratively implement the following steps to train LLMs (Step 1-4). If a patient asks a question about the details of Lexapro, we generate $k$ predictions with temperature sampling (Step 1). These predictions are evaluated based on three main categories of our preferred evaluation metrics. We compute the multifaceted automatic evaluation and sort predictions with the score (Step 2). We distinguish two sets (preferred and dispreferred) using a pre-determined threshold to construct the synthetic alignment pair dataset (Step 3). We then train the LLMs through preference optimization such as DPO rafailov2024direct (Step 4). Finally, we obtain the preferred answer to the patient's question. Here, we omit the SFT training part.
  • Figure 4: Iterative learning results of the K-QA Golden dataset using BioMistral 7B.
  • Figure 5: We evaluate factuality using FActScore performance which is not used evaluation metric during training. We report FActScore without length penalty as a metric. We supply domain-specific knowledge due to the potential lack of biomedical knowledge. We also provide the GPT-4 score for the upper bound of FActScore performance. We observe that starting with SFT shows performance degradation, but demonstrates its highest effectiveness with iterative alignment tuning.
  • ...and 3 more figures