Table of Contents
Fetching ...

An Empirical Comparison of Text Summarization: A Multi-Dimensional Evaluation of Large Language Models

Anantharaman Janakiraman, Behnaz Ghoraani

TL;DR

This research evaluates summarization performance across 17 large language models (OpenAI, Google, Anthropic, Anthropic, open-source) using a novel multi-dimensional framework, identifying a critical tension between factual consistency and perceived quality.

Abstract

Text summarization is crucial for mitigating information overload across domains like journalism, medicine, and business. This research evaluates summarization performance across 17 large language models (OpenAI, Google, Anthropic, open-source) using a novel multi-dimensional framework. We assessed models on seven diverse datasets (BigPatent, BillSum, CNN/DailyMail, PubMed, SAMSum, WikiHow, XSum) at three output lengths (50, 100, 150 tokens) using metrics for factual consistency, semantic similarity, lexical overlap, and human-like quality, while also considering efficiency factors. Our findings reveal significant performance differences, with specific models excelling in factual accuracy (deepseek-v3), human-like quality (claude-3-5-sonnet), and processing efficiency/cost-effectiveness (gemini-1.5-flash, gemini-2.0-flash). Performance varies dramatically by dataset, with models struggling on technical domains but performing well on conversational content. We identified a critical tension between factual consistency (best at 50 tokens) and perceived quality (best at 150 tokens). Our analysis provides evidence-based recommendations for different use cases, from high-stakes applications requiring factual accuracy to resource-constrained environments needing efficient processing. This comprehensive approach enhances evaluation methodology by integrating quality metrics with operational considerations, incorporating trade-offs between accuracy, efficiency, and cost-effectiveness to guide model selection for specific applications.

An Empirical Comparison of Text Summarization: A Multi-Dimensional Evaluation of Large Language Models

TL;DR

This research evaluates summarization performance across 17 large language models (OpenAI, Google, Anthropic, Anthropic, open-source) using a novel multi-dimensional framework, identifying a critical tension between factual consistency and perceived quality.

Abstract

Text summarization is crucial for mitigating information overload across domains like journalism, medicine, and business. This research evaluates summarization performance across 17 large language models (OpenAI, Google, Anthropic, open-source) using a novel multi-dimensional framework. We assessed models on seven diverse datasets (BigPatent, BillSum, CNN/DailyMail, PubMed, SAMSum, WikiHow, XSum) at three output lengths (50, 100, 150 tokens) using metrics for factual consistency, semantic similarity, lexical overlap, and human-like quality, while also considering efficiency factors. Our findings reveal significant performance differences, with specific models excelling in factual accuracy (deepseek-v3), human-like quality (claude-3-5-sonnet), and processing efficiency/cost-effectiveness (gemini-1.5-flash, gemini-2.0-flash). Performance varies dramatically by dataset, with models struggling on technical domains but performing well on conversational content. We identified a critical tension between factual consistency (best at 50 tokens) and perceived quality (best at 150 tokens). Our analysis provides evidence-based recommendations for different use cases, from high-stakes applications requiring factual accuracy to resource-constrained environments needing efficient processing. This comprehensive approach enhances evaluation methodology by integrating quality metrics with operational considerations, incorporating trade-offs between accuracy, efficiency, and cost-effectiveness to guide model selection for specific applications.

Paper Structure

This paper contains 32 sections, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Multi-dimensional evaluation framework for assessing large language models on text summarization
  • Figure 2: Ranking Process
  • Figure 3: Comparative analysis of 17 models across all evaluation dimensions, normalized to the 0-1 range for fair comparison.
  • Figure 4: Impact of Summary Length on Performance Metrics: This comprehensive visualization shows how different quality metrics vary with summary length (50, 100, and 150 tokens).
  • Figure 5: Quality-Efficiency Trade-off: Bubble size represents factual consistency score; position shows the balance between quality (y-axis) and efficiency (x-axis) metrics.