Table of Contents
Fetching ...

Unraveling the Capabilities of Language Models in News Summarization

Abdurrahman Odabaşı, Göksel Biricik

TL;DR

The study conducts a comprehensive benchmark of 20 contemporary language models, spanning large private and smaller public LMs, for news summarization across CNN/DM, Newsroom, and XSum in zero-shot and three-shot settings. Using a multi-faceted evaluation framework (automatic metrics, human judgments, and AI-based scoring with Claude 3 Sonnet), it reveals the persistent dominance of large models while identifying several smaller models with competitive performance in specific datasets or settings. It also shows that few-shot demonstrations often fail due to low-quality gold summaries and evaluation biases, and it highlights common failure modes such as early termination and hallucinations. The work underscores the importance of data quality and prompts future directions including higher-quality gold standards, genre-aware analysis, tuned generation settings, and extensions to multi-document and multilingual summarization to broaden applicability.

Abstract

Given the recent introduction of multiple language models and the ongoing demand for improved Natural Language Processing tasks, particularly summarization, this work provides a comprehensive benchmarking of 20 recent language models, focusing on smaller ones for the news summarization task. In this work, we systematically test the capabilities and effectiveness of these models in summarizing news article texts which are written in different styles and presented in three distinct datasets. Specifically, we focus in this study on zero-shot and few-shot learning settings and we apply a robust evaluation methodology that combines different evaluation concepts including automatic metrics, human evaluation, and LLM-as-a-judge. Interestingly, including demonstration examples in the few-shot learning setting did not enhance models' performance and, in some cases, even led to worse quality of the generated summaries. This issue arises mainly due to the poor quality of the gold summaries that have been used as reference summaries, which negatively impacts the models' performance. Furthermore, our study's results highlight the exceptional performance of GPT-3.5-Turbo and GPT-4, which generally dominate due to their advanced capabilities. However, among the public models evaluated, certain models such as Qwen1.5-7B, SOLAR-10.7B-Instruct-v1.0, Meta-Llama-3-8B and Zephyr-7B-Beta demonstrated promising results. These models showed significant potential, positioning them as competitive alternatives to large models for the task of news summarization.

Unraveling the Capabilities of Language Models in News Summarization

TL;DR

The study conducts a comprehensive benchmark of 20 contemporary language models, spanning large private and smaller public LMs, for news summarization across CNN/DM, Newsroom, and XSum in zero-shot and three-shot settings. Using a multi-faceted evaluation framework (automatic metrics, human judgments, and AI-based scoring with Claude 3 Sonnet), it reveals the persistent dominance of large models while identifying several smaller models with competitive performance in specific datasets or settings. It also shows that few-shot demonstrations often fail due to low-quality gold summaries and evaluation biases, and it highlights common failure modes such as early termination and hallucinations. The work underscores the importance of data quality and prompts future directions including higher-quality gold standards, genre-aware analysis, tuned generation settings, and extensions to multi-document and multilingual summarization to broaden applicability.

Abstract

Given the recent introduction of multiple language models and the ongoing demand for improved Natural Language Processing tasks, particularly summarization, this work provides a comprehensive benchmarking of 20 recent language models, focusing on smaller ones for the news summarization task. In this work, we systematically test the capabilities and effectiveness of these models in summarizing news article texts which are written in different styles and presented in three distinct datasets. Specifically, we focus in this study on zero-shot and few-shot learning settings and we apply a robust evaluation methodology that combines different evaluation concepts including automatic metrics, human evaluation, and LLM-as-a-judge. Interestingly, including demonstration examples in the few-shot learning setting did not enhance models' performance and, in some cases, even led to worse quality of the generated summaries. This issue arises mainly due to the poor quality of the gold summaries that have been used as reference summaries, which negatively impacts the models' performance. Furthermore, our study's results highlight the exceptional performance of GPT-3.5-Turbo and GPT-4, which generally dominate due to their advanced capabilities. However, among the public models evaluated, certain models such as Qwen1.5-7B, SOLAR-10.7B-Instruct-v1.0, Meta-Llama-3-8B and Zephyr-7B-Beta demonstrated promising results. These models showed significant potential, positioning them as competitive alternatives to large models for the task of news summarization.

Paper Structure

This paper contains 30 sections, 1 figure, 9 tables.

Figures (1)

  • Figure 1: Overlap ratio distributions for the training sets of the three datasets (CNN/DM, Newsroom, XSum), visualized as normalized histograms with overlaid Kernel Density Estimate (KDE) curves. The x-axis represents the overlap ratio, while the y-axis indicates the percentage density, highlighting differences in overlap characteristics among the datasets.