MaLei at the PLABA Track of TREC 2024: RoBERTa for Term Replacement -- LLaMA3.1 and GPT-4o for Complete Abstract Adaptation
Zhidong Ling, Zihao Li, Pablo Romero, Lifeng Han, Goran Nenadic
TL;DR
This paper examines Plain Language Adaptation of Biomedical Abstracts (PLABA) 2024, addressing two tasks—term replacement and complete abstract adaptation—through a diverse model suite including RoBERTa-based classifiers, T5, SciFive, BART with control tokens, and large-language-model prompts (LLaMA-3.1 and GPT-4o). It reports that a compact RoBERTa-Base model ranks highly on term replacement, while LLaMA-3.1-70B-Instruct achieves the top Completeness score for abstract adaptation, with GPT-4o showing competitive one-shot performance. Automatic metrics (BLEU, ROUGE, SARI, BERTScore) and human evaluations reveal trade-offs: BART-CTs delivers stronger simplification but can alter meaning, whereas T5 variants tend to preserve meaning better; these findings underscore the limitations of any single metric and the value of combining automatic and human assessments. The authors provide public code and models, contributing practical resources for advancing biomedical plain-language technologies and waiving insight into prompt-based and controllable approaches for PLA tasks.
Abstract
This report is the system description of the MaLei team (Manchester and Leiden) for the shared task Plain Language Adaptation of Biomedical Abstracts (PLABA) 2024 (we had an earlier name BeeManc following last year), affiliated with TREC2024 (33rd Text REtrieval Conference https://ir.nist.gov/evalbase/conf/trec-2024). This report contains two sections corresponding to the two sub-tasks in PLABA-2024. In task one (term replacement), we applied fine-tuned ReBERTa-Base models to identify and classify the difficult terms, jargon, and acronyms in the biomedical abstracts and reported the F1 score (Task 1A and 1B). In task two (complete abstract adaptation), we leveraged Llamma3.1-70B-Instruct and GPT-4o with the one-shot prompts to complete the abstract adaptation and reported the scores in BLEU, SARI, BERTScore, LENS, and SALSA. From the official Evaluation from PLABA-2024 on Task 1A and 1B, our much smaller fine-tuned RoBERTa-Base model ranked 3rd and 2nd respectively on the two sub-tasks, and the 1st on averaged F1 scores across the two tasks from 9 evaluated systems. Our LLaMA-3.1-70B-instructed model achieved the highest Completeness score for Task 2. We share our source codes, fine-tuned models, and related resources at https://github.com/HECTA-UoM/PLABA2024
