Evaluating LLMs and Pre-trained Models for Text Summarization Across Diverse Datasets
Tohida Rehman, Soumabha Ghosh, Kuntal Das, Souvik Bhattacharjee, Debarshi Kumar Sanyal, Samiran Chattopadhyay
TL;DR
The paper systematically evaluates four pre-trained language models (BART, FLAN-T5, LLaMA-3-8B, Gemma-7B) for abstractive text summarization across five datasets (CNN/DM, Gigaword, News Summary, XSum, BBC News). Fine-tuning is performed with LoRA adapters for the larger models, and a mix of encoder–decoder and decoder-only architectures is analyzed using ROUGE, METEOR, and BERTScore, plus qualitative case studies. Key findings show dataset-dependent strengths: FLAN-T5 excels on CNN/DM, Gemma-7B dominates Gigaword and XSum, BART leads on News Summary, and LLaMA-3-8B performs strongly on BBC News; human judgments via ChatGPT largely align with automatic metrics. The study highlights practical considerations in model selection, resource usage, and the persistent issues of repetition and hallucination, informing future work on human-in-the-loop evaluation and mitigation strategies for abstractive summarization.
Abstract
Text summarization plays a crucial role in natural language processing by condensing large volumes of text into concise and coherent summaries. As digital content continues to grow rapidly and the demand for effective information retrieval increases, text summarization has become a focal point of research in recent years. This study offers a thorough evaluation of four leading pre-trained and open-source large language models: BART, FLAN-T5, LLaMA-3-8B, and Gemma-7B, across five diverse datasets CNN/DM, Gigaword, News Summary, XSum, and BBC News. The evaluation employs widely recognized automatic metrics, including ROUGE-1, ROUGE-2, ROUGE-L, BERTScore, and METEOR, to assess the models' capabilities in generating coherent and informative summaries. The results reveal the comparative strengths and limitations of these models in processing various text types.
