Table of Contents
Fetching ...

SESCORE2: Learning Text Generation Evaluation via Synthesizing Realistic Mistakes

Wenda Xu, Xian Qian, Mingxuan Wang, Lei Li, William Yang Wang

TL;DR

SEScore2 presents a self-supervised, retrieval-augmented approach to learning a general text-generation evaluation metric that synthesizes realistic mistakes from retrieved neighbors to generate training data without human ratings. A regression head predicts a severity-weighted quality score for (reference, hypothesis) pairs, trained on large multilingual MT-like data and evaluated across MT, ST, D2T, and dialogue tasks, showing strong Kendall correlations and robustness to domain shifts. The method outperforms unsupervised baselines and rivals or surpasses some supervised metrics, with ablations highlighting the value of RA perturbations and severity measures. The approach scales to multiple languages and domains, though it acknowledges limitations in severity-label evaluation and open-ended generation, pointing to future directions for enhancement and broader applicability.

Abstract

Is it possible to train a general metric for evaluating text generation quality without human annotated ratings? Existing learned metrics either perform unsatisfactorily across text generation tasks or require human ratings for training on specific tasks. In this paper, we propose SESCORE2, a self-supervised approach for training a model-based metric for text generation evaluation. The key concept is to synthesize realistic model mistakes by perturbing sentences retrieved from a corpus. The primary advantage of the SESCORE2 is its ease of extension to many other languages while providing reliable severity estimation. We evaluate SESCORE2 and previous methods on four text generation tasks across three languages. SESCORE2 outperforms unsupervised metric PRISM on four text generation evaluation benchmarks, with a Kendall improvement of 0.078. Surprisingly, SESCORE2 even outperforms the supervised BLEURT and COMET on multiple text generation tasks. The code and data are available at https://github.com/xu1998hz/SEScore2.

SESCORE2: Learning Text Generation Evaluation via Synthesizing Realistic Mistakes

TL;DR

SEScore2 presents a self-supervised, retrieval-augmented approach to learning a general text-generation evaluation metric that synthesizes realistic mistakes from retrieved neighbors to generate training data without human ratings. A regression head predicts a severity-weighted quality score for (reference, hypothesis) pairs, trained on large multilingual MT-like data and evaluated across MT, ST, D2T, and dialogue tasks, showing strong Kendall correlations and robustness to domain shifts. The method outperforms unsupervised baselines and rivals or surpasses some supervised metrics, with ablations highlighting the value of RA perturbations and severity measures. The approach scales to multiple languages and domains, though it acknowledges limitations in severity-label evaluation and open-ended generation, pointing to future directions for enhancement and broader applicability.

Abstract

Is it possible to train a general metric for evaluating text generation quality without human annotated ratings? Existing learned metrics either perform unsatisfactorily across text generation tasks or require human ratings for training on specific tasks. In this paper, we propose SESCORE2, a self-supervised approach for training a model-based metric for text generation evaluation. The key concept is to synthesize realistic model mistakes by perturbing sentences retrieved from a corpus. The primary advantage of the SESCORE2 is its ease of extension to many other languages while providing reliable severity estimation. We evaluate SESCORE2 and previous methods on four text generation tasks across three languages. SESCORE2 outperforms unsupervised metric PRISM on four text generation evaluation benchmarks, with a Kendall improvement of 0.078. Surprisingly, SESCORE2 even outperforms the supervised BLEURT and COMET on multiple text generation tasks. The code and data are available at https://github.com/xu1998hz/SEScore2.
Paper Structure (38 sections, 1 equation, 5 figures, 14 tables)

This paper contains 38 sections, 1 equation, 5 figures, 14 tables.

Figures (5)

  • Figure 1: 4-point star represents the anchor sentence. Circles and triangles represent the sentences with minor and major mistakes. Both are hard negatives. Green stars are easy negatives produced by random token transformations. Circles that are inner indicate the negative samples that are harder.
  • Figure 2: Retrieval Augmented Synthesis: we denote anchor text, selected neighbor, and synthesized text as blue star, circle and triangle respectively. We randomly select a subset of proposed transformations (ticks) and estimate severity measures (SE) on them. Final score sums up the individual severity measures.
  • Figure 3: Source Chinese text means 'I like dogs'. First, our retrieval augmented synthesis replaces 'dog' with 'cat'. Then, 'cat' is replaced by a special token '</s>' and we estimate the probability of recovering '</s>' to 'cat' given the source and target context. Then, we apply a threshold to generate major and minor labels.
  • Figure 4: Left figure indicates the comparisons between SEScore2 trained from retrieval augmented synthesis and random token transformations. Middle and right figure indicate individual operations contribute to final SEScore2 and effects of severity measures at News and TED domains. W.S means with severity measures and Wo.S means without severity measures.
  • Figure 5: Kendall correlations at Multi-dimensional WebNLG and BAGEL benchmarks. We select top four performing metrics.