CASPR: Automated Evaluation Metric for Contrastive Summarization

Nirupan Ananthamurugan; Dat Duong; Philip George; Ankita Gupta; Sandeep Tata; Beliz Gunel

CASPR: Automated Evaluation Metric for Contrastive Summarization

Nirupan Ananthamurugan, Dat Duong, Philip George, Ankita Gupta, Sandeep Tata, Beliz Gunel

TL;DR

This work tackles the problem of automatically evaluating contrastive summaries that compare two entities by introducing CASPR, an NLI-based metric that operates on decomposed single-claim sentences. CASPR computes directional logical relationships via NLI, aggregates them into a summary-level score, and scales it to [0,100], aiming to capture true contrast beyond lexical similarity. Experiments on the CoCoTrip dataset show CASPR outperforms token-overlap (Distinctiveness Score) and inverted BERTScore in detecting logical contrast, including cases involving negation, while highlighting some limitations and avenues for improvement. The approach is lightweight, relies on off-the-shelf models, and has practical implications for rapid, automated evaluation in contrastive summarization tasks.

Abstract

Summarizing comparative opinions about entities (e.g., hotels, phones) from a set of source reviews, often referred to as contrastive summarization, can considerably aid users in decision making. However, reliably measuring the contrastiveness of the output summaries without relying on human evaluations remains an open problem. Prior work has proposed token-overlap based metrics, Distinctiveness Score, to measure contrast which does not take into account the sensitivity to meaning-preserving lexical variations. In this work, we propose an automated evaluation metric CASPR to better measure contrast between a pair of summaries. Our metric is based on a simple and light-weight method that leverages natural language inference (NLI) task to measure contrast by segmenting reviews into single-claim sentences and carefully aggregating NLI scores between them to come up with a summary-level score. We compare CASPR with Distinctiveness Score and a simple yet powerful baseline based on BERTScore. Our results on a prior dataset CoCoTRIP demonstrate that CASPR can more reliably capture the contrastiveness of the summary pairs compared to the baselines.

CASPR: Automated Evaluation Metric for Contrastive Summarization

TL;DR

Abstract

Paper Structure (20 sections, 10 equations, 4 figures, 6 tables)

This paper contains 20 sections, 10 equations, 4 figures, 6 tables.

Introduction
Approach
Baselines
Distinctiveness Score
Inverted BERTscore $(BS^{-1})$
CASPR
Experiments
Dataset: CoCoTrip
Experimental Setup
Results
Related Work
Conclusion
Limitations
Sentence Decomposition
Experimental Datasets
...and 5 more sections

Figures (4)

Figure 1: Contrastive summaries for Hotel A and Hotel B, denoted as $A \setminus B$ and $B \setminus A$, respectively. The underlined sentences in the summaries assign different values ('small', 'good size') to the same aspect (room size), and are therefore contrastive. The italicized sentences in the summaries describe different aspects (breakfast and staff) -- also highlighting differences to help users decide.
Figure 2: Average contrast scores across Synthetic Low Contrast ($S^{A \setminus B}, P(S^{A \setminus B})$), Reference Similar ($S^{A \setminus B}_1, S^{A \setminus B}_2$), Reference Contrastive ($S^{A \setminus B}, S^{B \setminus A}$), and Synthetic High Contrast ($S^{A \setminus B}, \neg S^{A \setminus B}$) datasets. ${\bf CASPR}$ has a higher separation in contrast scores for all experiments, scores closest to 0 on Synthetic Low Contrast, and scores closest to 100 on Synthetic High Contrast as desired.
Figure 3: Our System Prompt to gpt-3.5-turbo for the sentence splitting task. We also show an example from Reference Summary A of entity pair 3.
Figure 4: Paraphrase example with prompt input 'Paraphrase this' using text-davinci-003

CASPR: Automated Evaluation Metric for Contrastive Summarization

TL;DR

Abstract

CASPR: Automated Evaluation Metric for Contrastive Summarization

Authors

TL;DR

Abstract

Table of Contents

Figures (4)