Table of Contents
Fetching ...

Evaluating Small Language Models for News Summarization: Implications and Factors Influencing Performance

Borui Xu, Yao Chen, Zeyi Wen, Weiguo Liu, Bingsheng He

TL;DR

This paper addresses the problem of resource-efficient news summarization by evaluating 19 small language models (SLMs) against large LLM baselines across 2000 news articles from four datasets. It employs an LLM-augmented reference-based evaluation framework, using two high-quality LLMs to generate references and BertScore and factual-consistency metrics to compare performance. The findings show that top SLMs such as Phi3-Mini and Llama3.2-3B-Ins can match or approach 70B LLMs in relevance, coherence, and factual consistency while producing shorter summaries, illustrating strong potential for edge deployment. The study also reveals that simple prompts suffice for SLMs and that instruction tuning yields inconsistent improvements, highlighting practical guidance for deployment and avenues for future research on longer texts and model quantization.

Abstract

The increasing demand for efficient summarization tools in resource-constrained environments highlights the need for effective solutions. While large language models (LLMs) deliver superior summarization quality, their high computational resource requirements limit practical use applications. In contrast, small language models (SLMs) present a more accessible alternative, capable of real-time summarization on edge devices. However, their summarization capabilities and comparative performance against LLMs remain underexplored. This paper addresses this gap by presenting a comprehensive evaluation of 19 SLMs for news summarization across 2,000 news samples, focusing on relevance, coherence, factual consistency, and summary length. Our findings reveal significant variations in SLM performance, with top-performing models such as Phi3-Mini and Llama3.2-3B-Ins achieving results comparable to those of 70B LLMs while generating more concise summaries. Notably, SLMs are better suited for simple prompts, as overly complex prompts may lead to a decline in summary quality. Additionally, our analysis indicates that instruction tuning does not consistently enhance the news summarization capabilities of SLMs. This research not only contributes to the understanding of SLMs but also provides practical insights for researchers seeking efficient summarization solutions that balance performance and resource use.

Evaluating Small Language Models for News Summarization: Implications and Factors Influencing Performance

TL;DR

This paper addresses the problem of resource-efficient news summarization by evaluating 19 small language models (SLMs) against large LLM baselines across 2000 news articles from four datasets. It employs an LLM-augmented reference-based evaluation framework, using two high-quality LLMs to generate references and BertScore and factual-consistency metrics to compare performance. The findings show that top SLMs such as Phi3-Mini and Llama3.2-3B-Ins can match or approach 70B LLMs in relevance, coherence, and factual consistency while producing shorter summaries, illustrating strong potential for edge deployment. The study also reveals that simple prompts suffice for SLMs and that instruction tuning yields inconsistent improvements, highlighting practical guidance for deployment and avenues for future research on longer texts and model quantization.

Abstract

The increasing demand for efficient summarization tools in resource-constrained environments highlights the need for effective solutions. While large language models (LLMs) deliver superior summarization quality, their high computational resource requirements limit practical use applications. In contrast, small language models (SLMs) present a more accessible alternative, capable of real-time summarization on edge devices. However, their summarization capabilities and comparative performance against LLMs remain underexplored. This paper addresses this gap by presenting a comprehensive evaluation of 19 SLMs for news summarization across 2,000 news samples, focusing on relevance, coherence, factual consistency, and summary length. Our findings reveal significant variations in SLM performance, with top-performing models such as Phi3-Mini and Llama3.2-3B-Ins achieving results comparable to those of 70B LLMs while generating more concise summaries. Notably, SLMs are better suited for simple prompts, as overly complex prompts may lead to a decline in summary quality. Additionally, our analysis indicates that instruction tuning does not consistently enhance the news summarization capabilities of SLMs. This research not only contributes to the understanding of SLMs but also provides practical insights for researchers seeking efficient summarization solutions that balance performance and resource use.

Paper Structure

This paper contains 26 sections, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Variation of SLMs in size and BertScore for news summarization.
  • Figure 2: Comparison of text summarization evaluation.
  • Figure 3: The prompt templates for the language model to generate summaries.
  • Figure 4: Example summaries from SLMs and LLMs. The bold part is the same as the reference, the underline indicates irrelevant content, and the red indicates incorrect content. Summaries with BertScore above 70, such as those from Llama3.2-3B-Ins, demonstrate similar quality to LLMs.
  • Figure 5: Average summary length comparison. SLMs with high BertScore generate 50-70 word summaries.
  • ...and 2 more figures