WRDScore: New Metric for Evaluation of Natural Language Generation Models

Ravil Mussabayev

WRDScore: New Metric for Evaluation of Natural Language Generation Models

Ravil Mussabayev

TL;DR

WRDScore introduces a normalized, precision-recall-based evaluation metric for natural language generation that leverages Kantorovich optimal transport to softly align token distributions. By defining precision as weighted token inclusion and recall via the optimal transport cost, the metric balances semantic and syntactic variation and remains interpretable on a 0–1 scale. The authors demonstrate, through human evaluations on Java method-name data, that WRDScore correlates more closely with human judgments and outperforms ROUGE and BERTScore in key settings. The work includes a reproducibility package and emphasizes practical applicability to method-name prediction and code-related NLG tasks.

Abstract

Evaluating natural language generation models, particularly for method name prediction, poses significant challenges. A robust metric must account for the versatility of method naming, considering both semantic and syntactic variations. Traditional overlap-based metrics, such as ROUGE, fail to capture these nuances. Existing embedding-based metrics often suffer from imbalanced precision and recall, lack normalized scores, or make unrealistic assumptions about sequences. To address these limitations, we leverage the theory of optimal transport and construct WRDScore, a novel metric that strikes a balance between simplicity and effectiveness. In the WRDScore framework, we define precision as the maximum degree to which the predicted sequence's tokens are included in the reference sequence, token by token. Recall is calculated as the total cost of the optimal transport plan that maps the reference sequence to the predicted one. Finally, WRDScore is computed as the harmonic mean of precision and recall, balancing these two complementary metrics. Our metric is lightweight, normalized, and precision-recall-oriented, avoiding unrealistic assumptions while aligning well with human judgments. Experiments on a human-curated dataset confirm the superiority of WRDScore over other available text metrics.

WRDScore: New Metric for Evaluation of Natural Language Generation Models

TL;DR

Abstract

WRDScore: New Metric for Evaluation of Natural Language Generation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (2)