Table of Contents
Fetching ...

X-Eval: Generalizable Multi-aspect Text Evaluation via Augmented Instruction Tuning with Auxiliary Evaluation Aspects

Minqian Liu, Ying Shen, Zhiyang Xu, Yixin Cao, Eunah Cho, Vaibhav Kumar, Reza Ghanadan, Lifu Huang

TL;DR

X-Eval presents a generalizable, two-stage instruction-tuning framework for fine-grained NLG evaluation across seen and unseen aspects. It builds AspectInstruct to train an evaluator on 27 aspects across 65 tasks and augments tasks with auxiliary aspect evidence to exploit inter-aspect connections, using a verbalizer to feed natural-language auxiliary results during training and a top-k similarity-based selection during inference. Empirically, X-Eval with a lightweight 780M model matches or surpasses several lightweight baselines and approaches GPT-4 on multiple tasks (dialogue, summarization, data-to-text) while remaining reference-free and open-source. The work demonstrates strong zero-shot generalization to unseen aspects and tasks, with efficiency gains and flexible customization for practical NLG evaluation scenarios.

Abstract

Natural Language Generation (NLG) typically involves evaluating the generated text in various aspects (e.g., consistency and naturalness) to obtain a comprehensive assessment. However, multi-aspect evaluation remains challenging as it may require the evaluator to generalize to any given evaluation aspect even if it's absent during training. In this paper, we introduce X-Eval, a two-stage instruction tuning framework to evaluate the text in both seen and unseen aspects customized by end users. X-Eval consists of two learning stages: the vanilla instruction tuning stage that improves the model's ability to follow evaluation instructions, and an enhanced instruction tuning stage that exploits the connections between fine-grained evaluation aspects to better assess text quality. To support the training of X-Eval, we collect AspectInstruct, the first instruction tuning dataset tailored for multi-aspect NLG evaluation spanning 27 diverse evaluation aspects with 65 tasks. To enhance task diversity, we devise an augmentation strategy that converts human rating annotations into diverse forms of NLG evaluation tasks, including scoring, comparison, ranking, and Boolean question answering. Extensive experiments across three essential categories of NLG tasks: dialogue generation, summarization, and data-to-text coupled with 21 aspects in meta-evaluation, demonstrate that our X-Eval enables even a lightweight language model to achieve a comparable if not higher correlation with human judgments compared to the state-of-the-art NLG evaluators, such as GPT-4.

X-Eval: Generalizable Multi-aspect Text Evaluation via Augmented Instruction Tuning with Auxiliary Evaluation Aspects

TL;DR

X-Eval presents a generalizable, two-stage instruction-tuning framework for fine-grained NLG evaluation across seen and unseen aspects. It builds AspectInstruct to train an evaluator on 27 aspects across 65 tasks and augments tasks with auxiliary aspect evidence to exploit inter-aspect connections, using a verbalizer to feed natural-language auxiliary results during training and a top-k similarity-based selection during inference. Empirically, X-Eval with a lightweight 780M model matches or surpasses several lightweight baselines and approaches GPT-4 on multiple tasks (dialogue, summarization, data-to-text) while remaining reference-free and open-source. The work demonstrates strong zero-shot generalization to unseen aspects and tasks, with efficiency gains and flexible customization for practical NLG evaluation scenarios.

Abstract

Natural Language Generation (NLG) typically involves evaluating the generated text in various aspects (e.g., consistency and naturalness) to obtain a comprehensive assessment. However, multi-aspect evaluation remains challenging as it may require the evaluator to generalize to any given evaluation aspect even if it's absent during training. In this paper, we introduce X-Eval, a two-stage instruction tuning framework to evaluate the text in both seen and unseen aspects customized by end users. X-Eval consists of two learning stages: the vanilla instruction tuning stage that improves the model's ability to follow evaluation instructions, and an enhanced instruction tuning stage that exploits the connections between fine-grained evaluation aspects to better assess text quality. To support the training of X-Eval, we collect AspectInstruct, the first instruction tuning dataset tailored for multi-aspect NLG evaluation spanning 27 diverse evaluation aspects with 65 tasks. To enhance task diversity, we devise an augmentation strategy that converts human rating annotations into diverse forms of NLG evaluation tasks, including scoring, comparison, ranking, and Boolean question answering. Extensive experiments across three essential categories of NLG tasks: dialogue generation, summarization, and data-to-text coupled with 21 aspects in meta-evaluation, demonstrate that our X-Eval enables even a lightweight language model to achieve a comparable if not higher correlation with human judgments compared to the state-of-the-art NLG evaluators, such as GPT-4.
Paper Structure (55 sections, 2 equations, 6 figures, 17 tables, 1 algorithm)

This paper contains 55 sections, 2 equations, 6 figures, 17 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustration of X-Eval for multiple seen and unseen fine-grained evaluation aspects across various NLG tasks. The unseen aspect (i.e., Interestingness) is highlighted in italics. The text to be evaluated is highlighted with underline. In this example, each evaluation score is from 0 to 1. The higher score indicates better quality.
  • Figure 2: Illustration of our X-Eval framework. The left section depicts our two-stage training approach: vanilla instruction tuning on diverse tasks and subsequent training on instruction tasks enriched with auxiliary aspects. The right section illustrates the inference pipeline with auxiliary aspects.
  • Figure 3: Effect of the scale of language model backbones. For each meta-evaluation benchmark, we report the average Spearman correlation on all the aspects. X-Eval-large (780M) is the default backbone language model throughout all the experiments if there is no specification.
  • Figure 4: The scatter plots of correlation between human scores and predicted scores of X-Eval and Flan-T5, respectively.
  • Figure 5: Cosine similarity scores of the sentence embeddings of aspect definition in turn-level dialogue evaluation. Naturalness (NAT), coherence (COH), engagingness (ENG), and groundedness (GRO) are seen aspects, while the rest are unseen aspects.
  • ...and 1 more figures