Table of Contents
Fetching ...

Mitigating the Impact of Reference Quality on Evaluation of Summarization Systems with Reference-Free Metrics

Théo Gigant, Camille Guinaudeau, Marc Decombas, Frédéric Dufaux

TL;DR

This paper introduces a reference-free metric that correlates well with human evaluated relevance, while being very cheap to compute, and shows that this metric can also be used along reference-based metrics to improve their robustness in low quality reference settings.

Abstract

Automatic metrics are used as proxies to evaluate abstractive summarization systems when human annotations are too expensive. To be useful, these metrics should be fine-grained, show a high correlation with human annotations, and ideally be independent of reference quality; however, most standard evaluation metrics for summarization are reference-based, and existing reference-free metrics correlate poorly with relevance, especially on summaries of longer documents. In this paper, we introduce a reference-free metric that correlates well with human evaluated relevance, while being very cheap to compute. We show that this metric can also be used alongside reference-based metrics to improve their robustness in low quality reference settings.

Mitigating the Impact of Reference Quality on Evaluation of Summarization Systems with Reference-Free Metrics

TL;DR

This paper introduces a reference-free metric that correlates well with human evaluated relevance, while being very cheap to compute, and shows that this metric can also be used along reference-based metrics to improve their robustness in low quality reference settings.

Abstract

Automatic metrics are used as proxies to evaluate abstractive summarization systems when human annotations are too expensive. To be useful, these metrics should be fine-grained, show a high correlation with human annotations, and ideally be independent of reference quality; however, most standard evaluation metrics for summarization are reference-based, and existing reference-free metrics correlate poorly with relevance, especially on summaries of longer documents. In this paper, we introduce a reference-free metric that correlates well with human evaluated relevance, while being very cheap to compute. We show that this metric can also be used alongside reference-based metrics to improve their robustness in low quality reference settings.

Paper Structure

This paper contains 19 sections, 3 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: System-level correlations with human judgement for our metric, depending on the number of summaries used for evaluation
  • Figure 2: System-level correlation with human evaluation of relevance, depending on the number of altered references (RAND-3 alteration).
  • Figure 3: Complementarity between metrics on SummEval
  • Figure 4: Length penalty $\alpha_{ \hat{s}, d} = f({| \hat{s}|},{|d|})$ with $f: | \hat{s}|, | \hat{d}| \mapsto \frac{1}{1 + \exp(20 * \frac{| \hat{s}|}{|d|} - 10) }$
  • Figure 5: Distribution of system-level correlations of our metric in different settings
  • ...and 5 more figures