$\texttt{COSMIC}$: Mutual Information for Task-Agnostic Summarization Evaluation

Maxime Darrin; Philippe Formont; Jackie Chi Kit Cheung; Pablo Piantanida

$\texttt{COSMIC}$: Mutual Information for Task-Agnostic Summarization Evaluation

Maxime Darrin, Philippe Formont, Jackie Chi Kit Cheung, Pablo Piantanida

TL;DR

This work foregrounds a principled, task-agnostic approach to summarization evaluation by linking downstream task preservation to the mutual information between source texts and generated summaries, quantified by the COSMIC metric. COSMIC estimates $I(\mathbf{T};\mathbf{S})$ from samples using the KNIFE method with embeddings, and provides theoretical upper and lower bounds on downstream error via $I(\mathbf{T};\mathbf{S})$, offering a scalable, reference-free quality measure. Empirically, COSMIC correlates with human-judgment proxies and competitive baselines (e.g., BERTScore, BARTScore) across multiple datasets and downstream tasks, while revealing complementary relationships to learned metrics such as SEAHORSE. The results demonstrate COSMIC’s potential as a robust, information-theoretically grounded tool for evaluating summarization systems in practical, downstream contexts, albeit with acknowledged limitations related to embedding choices and dataset entropy.

Abstract

Assessing the quality of summarizers poses significant challenges. In response, we propose a novel task-oriented evaluation approach that assesses summarizers based on their capacity to produce summaries that are useful for downstream tasks, while preserving task outcomes. We theoretically establish a direct relationship between the resulting error probability of these tasks and the mutual information between source texts and generated summaries. We introduce $\texttt{COSMIC}$ as a practical implementation of this metric, demonstrating its strong correlation with human judgment-based metrics and its effectiveness in predicting downstream task performance. Comparative analyses against established metrics like $\texttt{BERTScore}$ and $\texttt{ROUGE}$ highlight the competitive performance of $\texttt{COSMIC}$.

$\texttt{COSMIC}$: Mutual Information for Task-Agnostic Summarization Evaluation

TL;DR

from samples using the KNIFE method with embeddings, and provides theoretical upper and lower bounds on downstream error via

, offering a scalable, reference-free quality measure. Empirically, COSMIC correlates with human-judgment proxies and competitive baselines (e.g., BERTScore, BARTScore) across multiple datasets and downstream tasks, while revealing complementary relationships to learned metrics such as SEAHORSE. The results demonstrate COSMIC’s potential as a robust, information-theoretically grounded tool for evaluating summarization systems in practical, downstream contexts, albeit with acknowledged limitations related to embedding choices and dataset entropy.

Abstract

as a practical implementation of this metric, demonstrating its strong correlation with human judgment-based metrics and its effectiveness in predicting downstream task performance. Comparative analyses against established metrics like

and

highlight the competitive performance of

Paper Structure (33 sections, 2 theorems, 29 equations, 12 figures, 5 tables, 1 algorithm)

This paper contains 33 sections, 2 theorems, 29 equations, 12 figures, 5 tables, 1 algorithm.

Introduction
Related Work
A Task-Driven Evaluation Framework
Background: Probabilistic models for text summarization
Background: Information Theory
A task-oriented evaluation setting
A Task-Agnostic Quality Measure for Summarization
Theoretical Results
Estimating MI from samples
Experimental Settings
Downstream tasks
Correlation With Learnt Metrics
Experimental Results
Discussion of Quantitative Results
Summary and Discussion
...and 18 more sections

Key Result

Proposition 1

Let $C = c(\mathbf{T})$ denote the underlying concept variable and let $\widehat{c}(\mathbf{S})$ be the Bayes predictor of $C$ observing the output summaries $\mathbf{S}$, based on the underlying summarization model $p_\theta (\mathbf{s}|\mathbf{t})$. The expected error rate satisfies: where $\kappa \in (0,1)$ is a constant which does not depend on summaries; and $R_{C,\ell_{01}}^{-1}(\cdot)$ is

Figures (12)

Figure 1: The summarizer is expected to generate summaries $\mathbf{S}$ for a given distribution of source texts $\mathbf{T}$, without prior knowledge of the specific application, e.g. predicting the concepts: $c_1,c_2,\dots$. The objective is to assess the discrepancy incurred when predicting from the generated summaries instead of the original source texts. We demonstrate that, for any undisclosed task $c$, the likelihood of error is constrained within bounds determined by monotonous functions of the MI.
Figure 2: Spearman correlation with human judgment estimated by the SEAHORSE metrics. As one would expect the MI does correlate with Attribution and Main ideas but not with comprehensible, grammar or repetition.
Figure 3: Comparison of models of different sizes and strengths, finetuned on different datasets regarding the MI between their summaries and the source text on different evaluation datasets. The dotted lines highlight the highest MI for each dataset reached by a model, and the grey area represents the number of parameters of the model
Figure 4: Correlation between the main metrics and different performance metrics for the different datasets. The MI (green) closely follows the behavior of the Attribution metric but is a better predictor of the performance of the downstream tasks (Sentiment analysis, GPT Detector, Topic classification, etc.). By contrast, ROUGE scores do not display consistent correlations. For metrics to be considered effective, they should consistently exhibit positive correlations with each other. In \ref{['sec:table_radar']} we report the aggregated numerical results.
Figure 5: The correlation matrix illustrates the relationships among various metrics, with clustering based on their correlation similarity. This clustering indicates the degree of similarity between metrics in terms of their correlation with each other. The goal is to have a diverse set of grounded yet independent metrics that assess different aspects of text summarization quality.
...and 7 more figures

Theorems & Definitions (9)

Proposition 1: Information-theoretic bounds
proof
Remark 1
Theorem 1
Remark 2
Definition 1: Predictive conditional entropies
Definition 2: Predictive $\mathcal{F}$-information
Remark 3
Remark 4

$\texttt{COSMIC}$: Mutual Information for Task-Agnostic Summarization Evaluation

TL;DR

Abstract

$\texttt{COSMIC}$: Mutual Information for Task-Agnostic Summarization Evaluation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (9)