Retrieval augmentation of large language models for lay language generation
Yue Guo, Wei Qiu, Gondy Leroy, Sheng Wang, Trevor Cohen
TL;DR
Automated lay language generation is hampered by the need to provide background information not present in source documents. The authors introduce CELLS, the largest diverse corpus of scientific abstracts and expert-authored lay summaries, and Retrieval-Augmented Lay Language (RALL) to inject background explanations alongside simplification. Through in-domain pre-training, definition- and embedding-based retrieval, and evaluation with transformer models and LLMs, they demonstrate improvements in content quality, readability, and interpretability, with LLMs showing mixed results. The work establishes a valuable resource and methodology for making biomedical knowledge more accessible, while also outlining challenges in factual alignment and evaluation that future work can address.
Abstract
Recent lay language generation systems have used Transformer models trained on a parallel corpus to increase health information accessibility. However, the applicability of these models is constrained by the limited size and topical breadth of available corpora. We introduce CELLS, the largest (63k pairs) and broadest-ranging (12 journals) parallel corpus for lay language generation. The abstract and the corresponding lay language summary are written by domain experts, assuring the quality of our dataset. Furthermore, qualitative evaluation of expert-authored plain language summaries has revealed background explanation as a key strategy to increase accessibility. Such explanation is challenging for neural models to generate because it goes beyond simplification by adding content absent from the source. We derive two specialized paired corpora from CELLS to address key challenges in lay language generation: generating background explanations and simplifying the original abstract. We adopt retrieval-augmented models as an intuitive fit for the task of background explanation generation, and show improvements in summary quality and simplicity while maintaining factual correctness. Taken together, this work presents the first comprehensive study of background explanation for lay language generation, paving the path for disseminating scientific knowledge to a broader audience. CELLS is publicly available at: https://github.com/LinguisticAnomalies/pls_retrieval.
