CASPR: Automated Evaluation Metric for Contrastive Summarization
Nirupan Ananthamurugan, Dat Duong, Philip George, Ankita Gupta, Sandeep Tata, Beliz Gunel
TL;DR
This work tackles the problem of automatically evaluating contrastive summaries that compare two entities by introducing CASPR, an NLI-based metric that operates on decomposed single-claim sentences. CASPR computes directional logical relationships via NLI, aggregates them into a summary-level score, and scales it to [0,100], aiming to capture true contrast beyond lexical similarity. Experiments on the CoCoTrip dataset show CASPR outperforms token-overlap (Distinctiveness Score) and inverted BERTScore in detecting logical contrast, including cases involving negation, while highlighting some limitations and avenues for improvement. The approach is lightweight, relies on off-the-shelf models, and has practical implications for rapid, automated evaluation in contrastive summarization tasks.
Abstract
Summarizing comparative opinions about entities (e.g., hotels, phones) from a set of source reviews, often referred to as contrastive summarization, can considerably aid users in decision making. However, reliably measuring the contrastiveness of the output summaries without relying on human evaluations remains an open problem. Prior work has proposed token-overlap based metrics, Distinctiveness Score, to measure contrast which does not take into account the sensitivity to meaning-preserving lexical variations. In this work, we propose an automated evaluation metric CASPR to better measure contrast between a pair of summaries. Our metric is based on a simple and light-weight method that leverages natural language inference (NLI) task to measure contrast by segmenting reviews into single-claim sentences and carefully aggregating NLI scores between them to come up with a summary-level score. We compare CASPR with Distinctiveness Score and a simple yet powerful baseline based on BERTScore. Our results on a prior dataset CoCoTRIP demonstrate that CASPR can more reliably capture the contrastiveness of the summary pairs compared to the baselines.
