Redefining Simplicity: Benchmarking Large Language Models from Lexical to Document Simplification
Jipeng Qiang, Minjiang Huang, Yi Zhu, Yunhao Yuan, Chaowei Zhang, Kui Yu
TL;DR
The paper conducts a comprehensive cross-task benchmarking of text simplification, evaluating lexical, syntactic, sentence, and document simplification using lightweight, open-source, and closed-source LLMs alongside traditional non-LLM baselines and human judgments. It finds that large language models, particularly GPT-4o, consistently outperform non-LLM methods across all tasks, with lightweight models excelling in certain sentence-level and syntactic tasks, and in some cases surpassing human references under automatic evaluation. The study also reveals limitations of existing evaluation metrics when faced with high-quality LLM outputs, motivating a paradigm shift toward new metrics and human-in-the-loop assessments. The authors propose four future directions—multi-level and personalized simplification, efficient deployment of lightweight LLMs, and evaluation frameworks that transcend human references—to guide TS research in the LLM era and broaden accessible, high-quality textual simplification.
Abstract
Text simplification (TS) refers to the process of reducing the complexity of a text while retaining its original meaning and key information. Existing work only shows that large language models (LLMs) have outperformed supervised non-LLM-based methods on sentence simplification. This study offers the first comprehensive analysis of LLM performance across four TS tasks: lexical, syntactic, sentence, and document simplification. We compare lightweight, closed-source and open-source LLMs against traditional non-LLM methods using automatic metrics and human evaluations. Our experiments reveal that LLMs not only outperform non-LLM approaches in all four tasks but also often generate outputs that exceed the quality of existing human-annotated references. Finally, we present some future directions of TS in the era of LLMs.
