Table of Contents
Fetching ...

RankME: Reliable Human Ratings for Natural Language Generation

Jekaterina Novikova, Ondřej Dušek, Verena Rieser

TL;DR

The paper addresses unreliable human judgments in NLG evaluation due to inconsistent ratings and limited metric design. It introduces RankME, a rank-based magnitude estimation method that combines continuous scales with relative assessments to improve reliability and enable multi-criteria evaluation. The authors show that separating evaluation criteria reduces cross-criterion correlations, that continuous scales boost inter-annotator agreement, and that RankME can be paired with TrueSkill for scalable system ranking. This approach yields more discriminative, actionable evaluation data and offers a cost-effective alternative for ranking multiple NLG systems.

Abstract

Human evaluation for natural language generation (NLG) often suffers from inconsistent user ratings. While previous research tends to attribute this problem to individual user preferences, we show that the quality of human judgements can also be improved by experimental design. We present a novel rank-based magnitude estimation method (RankME), which combines the use of continuous scales and relative assessments. We show that RankME significantly improves the reliability and consistency of human ratings compared to traditional evaluation methods. In addition, we show that it is possible to evaluate NLG systems according to multiple, distinct criteria, which is important for error analysis. Finally, we demonstrate that RankME, in combination with Bayesian estimation of system quality, is a cost-effective alternative for ranking multiple NLG systems.

RankME: Reliable Human Ratings for Natural Language Generation

TL;DR

The paper addresses unreliable human judgments in NLG evaluation due to inconsistent ratings and limited metric design. It introduces RankME, a rank-based magnitude estimation method that combines continuous scales with relative assessments to improve reliability and enable multi-criteria evaluation. The authors show that separating evaluation criteria reduces cross-criterion correlations, that continuous scales boost inter-annotator agreement, and that RankME can be paired with TrueSkill for scalable system ranking. This approach yields more discriminative, actionable evaluation data and offers a cost-effective alternative for ranking multiple NLG systems.

Abstract

Human evaluation for natural language generation (NLG) often suffers from inconsistent user ratings. While previous research tends to attribute this problem to individual user preferences, we show that the quality of human judgements can also be improved by experimental design. We present a novel rank-based magnitude estimation method (RankME), which combines the use of continuous scales and relative assessments. We show that RankME significantly improves the reliability and consistency of human ratings compared to traditional evaluation methods. In addition, we show that it is possible to evaluate NLG systems according to multiple, distinct criteria, which is important for error analysis. Finally, we demonstrate that RankME, in combination with Bayesian estimation of system quality, is a cost-effective alternative for ranking multiple NLG systems.

Paper Structure

This paper contains 7 sections, 4 tables.