Retrieval augmentation of large language models for lay language generation

Yue Guo; Wei Qiu; Gondy Leroy; Sheng Wang; Trevor Cohen

Retrieval augmentation of large language models for lay language generation

Yue Guo, Wei Qiu, Gondy Leroy, Sheng Wang, Trevor Cohen

TL;DR

Automated lay language generation is hampered by the need to provide background information not present in source documents. The authors introduce CELLS, the largest diverse corpus of scientific abstracts and expert-authored lay summaries, and Retrieval-Augmented Lay Language (RALL) to inject background explanations alongside simplification. Through in-domain pre-training, definition- and embedding-based retrieval, and evaluation with transformer models and LLMs, they demonstrate improvements in content quality, readability, and interpretability, with LLMs showing mixed results. The work establishes a valuable resource and methodology for making biomedical knowledge more accessible, while also outlining challenges in factual alignment and evaluation that future work can address.

Abstract

Recent lay language generation systems have used Transformer models trained on a parallel corpus to increase health information accessibility. However, the applicability of these models is constrained by the limited size and topical breadth of available corpora. We introduce CELLS, the largest (63k pairs) and broadest-ranging (12 journals) parallel corpus for lay language generation. The abstract and the corresponding lay language summary are written by domain experts, assuring the quality of our dataset. Furthermore, qualitative evaluation of expert-authored plain language summaries has revealed background explanation as a key strategy to increase accessibility. Such explanation is challenging for neural models to generate because it goes beyond simplification by adding content absent from the source. We derive two specialized paired corpora from CELLS to address key challenges in lay language generation: generating background explanations and simplifying the original abstract. We adopt retrieval-augmented models as an intuitive fit for the task of background explanation generation, and show improvements in summary quality and simplicity while maintaining factual correctness. Taken together, this work presents the first comprehensive study of background explanation for lay language generation, paving the path for disseminating scientific knowledge to a broader audience. CELLS is publicly available at: https://github.com/LinguisticAnomalies/pls_retrieval.

Retrieval augmentation of large language models for lay language generation

TL;DR

Abstract

Paper Structure (36 sections, 1 equation, 9 figures, 5 tables, 1 algorithm)

This paper contains 36 sections, 1 equation, 9 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Lay language summary generation
Lay language summarization datasets
Lay language summary generation methodologies
Retrieval-augmented text generation
Materials and methods
The CELLS Dataset
Data compilation
Dataset applications
Lay language generation
Simplification
Background explanation
Human Validation of Dataset
Dataset analysis
...and 21 more sections

Figures (9)

Figure 1: An example application of the GPSS algorithm. RL indicates the F1 score from ROUGE-L between the sentences in the abstract and plain language summary. For the background explanation subset, we combined unaligned target sentences (grey blocks) with proximal aligned sentences (green blocks). The example presented illustrates the generation of three paired examples ("pair") for the background explanation subset. All three pairs include the initial explanatory content that precedes the first matched sentence (RL = 14.63), as well as the sentence in the lay language summary that matches it. The second pair also includes the explanatory content after this matched sentences, and the third pair adds the following matched sentence also (i.e. the second sentence in the source abstract, and the lay language summary sentence that aligns with it). These combinations allow for the possibility that added content may relate to the preceding, or the subsequent sentence.
Figure 1: Models' performance in text generation. We used the F1 score of BLEU and METEOR to evaluate the generation quality of models on lay language generation, simplification, and background explanation tasks. P-values obtained through the t-test are employed to evaluate the performance of various models compared to the Vanilla model (BART). A p-value less than 0.05 is indicated by (*).
Figure 2: Dataset analysis. a, source and target Coleman-Liau readability scores for the 12 journals included in CELLS. Each dot represents one journal. Lower score indicates text is easier to read. b,c, Average length and Coleman-Liau readability score for source and target text for three tasks (i.e., lay language generation, simplification and background explanation). On average, target text is shorter and easier to read for all three tasks. "*" indicates that the score of the target significantly lower than that of the source with p-value $<$ 0.05 (paired t-test).
Figure 2: Models' performance in text generation on the validated dataset. We used the F1 score of ROUGE-L and BERTScore to evaluate the generation quality of models on lay language generation, simplification, and background explanation tasks. P-values obtained through the t-test are employed to evaluate the performance of various models compared to the Vanilla model (BART). A p-value less than 0.05 is indicated by (*).
Figure 3: Models' performance in text generation. We used the F1 score of ROUGE-L and BERTScore to evaluate the generation quality of models on lay language generation, simplification, and background explanation tasks. P-values obtained through the t-test are employed to evaluate the performance of various models compared to the Vanilla model (BART). * indicates statistical significance with Bonferroni-Holm correction for multiple hypothesis testing holm1979simple.
...and 4 more figures

Retrieval augmentation of large language models for lay language generation

TL;DR

Abstract

Retrieval augmentation of large language models for lay language generation

Authors

TL;DR

Abstract

Table of Contents

Figures (9)