An Architectural Advantage of The Instruction-Tuned LLM in Containing The Readability-Accuracy Tension in Text Simplification
P. Bilha Githinji, Aikaterini Meilliou, Zeming Liang, Lian Zhang, Peiwu Qin
TL;DR
This study benchmarks two general-purpose LLM classes on biomedical text simplification to understand readability-accuracy trade-offs without domain-tuning. Using a zero-shot prompt framework and two temperature settings, it compares instruction-tuned Mistral-Small 3 24B and reasoning-augmented Qwen2.5 32B against human references across 21 metrics. Results show Mistral achieving readability gains near human-level discourse fidelity (BERTScore ≈ 0.91) with a conservative lexical strategy, while QWen achieves similar readability but weaker discourse fidelity and more lexical expansion. A comprehensive correlation analysis reveals redundancies among readability metrics and architecture-specific patterns that inform metric selection and domain adaptation for text simplification. Overall, the findings highlight an architectural advantage for instruction-tuned LLMs in balancing readability and accuracy in biomedical text, with practical implications for scalable health information accessibility.
Abstract
The increasing health-seeking behavior and digital consumption of biomedical information by the general public necessitate scalable solutions for automatically adapting complex scientific and technical documents into plain language. Automatic text simplification solutions, including advanced large language models (LLMs), however, continue to face challenges in reliably arbitrating the tension between optimizing readability performance and ensuring preservation of discourse fidelity. This report empirically assesses two major classes of general-purpose LLMs, demonstrating how they navigate the readability-accuracy tension compared to a human benchmark. Using a comparative analysis of the instruction-tuned Mistral-Small 3 24B and the reasoning-augmented QWen2.5 32B, we identify an architectural advantage in the instruction-tuned LLM. Mistral exhibits a tempered lexical simplification strategy that enhances readability across a suite of metrics while preserving human-level discourse with a BERTScore of 0.91. QWen also attains enhanced readability performance and a reasonable BERTScore of 0.89, but its operational strategy shows a disconnect in balancing between readability and accuracy. Additionally, a comprehensive correlation analysis of a suite of 21 metrics spanning readability, discourse fidelity, content safety, and underlying distributional measures for mechanistic insights, confirms strong functional redundancies, and informs metric selection and domain adaptation for text simplification.
