Table of Contents
Fetching ...

WisPerMed at BioLaySumm: Adapting Autoregressive Large Language Models for Lay Summarization of Scientific Articles

Tabea M. G. Pakull, Hendrik Damm, Ahmad Idrissi-Yaghir, Henning Schäfer, Peter A. Horn, Christoph M. Friedrich

TL;DR

The paper tackles making biomedical literature accessible to lay audiences by adapting autoregressive LLMs through domain-specific fine-tuning and prompt design. It compares BioMistral-7B-DARE, Llama3-70B-Instruct, and OpenBioLLM-70B under zero-shot, few-shot, and fine-tuning regimes, introducing a Dynamic Expert Selection mechanism to optimize readability and factuality without references. Results indicate that fine-tuning generally yields the strongest performance across metrics, while few-shot prompts and DES further boost readability and factual correctness, with BioMistral often outperforming larger general models. The work demonstrates a practical path to high-quality lay summaries in biomedicine and points to future improvements in prompt engineering, DES optimization, and domain expansion.

Abstract

This paper details the efforts of the WisPerMed team in the BioLaySumm2024 Shared Task on automatic lay summarization in the biomedical domain, aimed at making scientific publications accessible to non-specialists. Large language models (LLMs), specifically the BioMistral and Llama3 models, were fine-tuned and employed to create lay summaries from complex scientific texts. The summarization performance was enhanced through various approaches, including instruction tuning, few-shot learning, and prompt variations tailored to incorporate specific context information. The experiments demonstrated that fine-tuning generally led to the best performance across most evaluated metrics. Few-shot learning notably improved the models' ability to generate relevant and factually accurate texts, particularly when using a well-crafted prompt. Additionally, a Dynamic Expert Selection (DES) mechanism to optimize the selection of text outputs based on readability and factuality metrics was developed. Out of 54 participants, the WisPerMed team reached the 4th place, measured by readability, factuality, and relevance. Determined by the overall score, our approach improved upon the baseline by approx. 5.5 percentage points and was only approx 1.5 percentage points behind the first place.

WisPerMed at BioLaySumm: Adapting Autoregressive Large Language Models for Lay Summarization of Scientific Articles

TL;DR

The paper tackles making biomedical literature accessible to lay audiences by adapting autoregressive LLMs through domain-specific fine-tuning and prompt design. It compares BioMistral-7B-DARE, Llama3-70B-Instruct, and OpenBioLLM-70B under zero-shot, few-shot, and fine-tuning regimes, introducing a Dynamic Expert Selection mechanism to optimize readability and factuality without references. Results indicate that fine-tuning generally yields the strongest performance across metrics, while few-shot prompts and DES further boost readability and factual correctness, with BioMistral often outperforming larger general models. The work demonstrates a practical path to high-quality lay summaries in biomedicine and points to future improvements in prompt engineering, DES optimization, and domain expansion.

Abstract

This paper details the efforts of the WisPerMed team in the BioLaySumm2024 Shared Task on automatic lay summarization in the biomedical domain, aimed at making scientific publications accessible to non-specialists. Large language models (LLMs), specifically the BioMistral and Llama3 models, were fine-tuned and employed to create lay summaries from complex scientific texts. The summarization performance was enhanced through various approaches, including instruction tuning, few-shot learning, and prompt variations tailored to incorporate specific context information. The experiments demonstrated that fine-tuning generally led to the best performance across most evaluated metrics. Few-shot learning notably improved the models' ability to generate relevant and factually accurate texts, particularly when using a well-crafted prompt. Additionally, a Dynamic Expert Selection (DES) mechanism to optimize the selection of text outputs based on readability and factuality metrics was developed. Out of 54 participants, the WisPerMed team reached the 4th place, measured by readability, factuality, and relevance. Determined by the overall score, our approach improved upon the baseline by approx. 5.5 percentage points and was only approx 1.5 percentage points behind the first place.
Paper Structure (16 sections, 6 figures, 3 tables)

This paper contains 16 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Workflow of the Dynamic Expert Selection (DES) mechanism in the few-shot setting using an example from the PLOS dataset. The process involves ranking examples, generating multiple summaries through various prompt variations, applying a large language model (LLM), and then normalizing and weighing the readability (R) and factuality (F) scores to rank and select the best summary based on the selection scores (S).
  • Figure 2: The prompt used for fine-tuning BioM and as the initial prompt in the zero- and few-shot settings. For fine-tuning the prompt also includes the target lay summary.
  • Figure 3: The prompt used for fine-tuning Llama3. For fine-tuning the prompt also includes the target lay summary.
  • Figure 4: The Persona-Prompt used in zero- and few-shot setting with BioM.
  • Figure 5: The Intro-Prompt used in zero- and few-shot setting with BioM.
  • ...and 1 more figures