Table of Contents
Fetching ...

compare-mt: A Tool for Holistic Comparison of Language Generation Systems

Graham Neubig, Zi-Yi Dou, Junjie Hu, Paul Michel, Danish Pruthi, Xinyi Wang, John Wieting

TL;DR

This paper introduces compare-mt, an open-source toolkit for holistic evaluation of language-generation systems that addresses the opacity of standard metrics. It provides aggregate-score, bucketed, n-gram difference, and sentence-example analyses, plus advanced features like label-wise abstraction, source-side analysis, and log-likelihood inspection. By enabling rapid identification of salient system differences, compare-mt guides targeted, fine-grained analyses and improvements. The authors demonstrate practical utility through case studies and emphasize the tool's extensibility and public availability.

Abstract

In this paper, we describe compare-mt, a tool for holistic analysis and comparison of the results of systems for language generation tasks such as machine translation. The main goal of the tool is to give the user a high-level and coherent view of the salient differences between systems that can then be used to guide further analysis or system improvement. It implements a number of tools to do so, such as analysis of accuracy of generation of particular types of words, bucketed histograms of sentence accuracies or counts based on salient characteristics, and extraction of characteristic $n$-grams for each system. It also has a number of advanced features such as use of linguistic labels, source side data, or comparison of log likelihoods for probabilistic models, and also aims to be easily extensible by users to new types of analysis. The code is available at https://github.com/neulab/compare-mt

compare-mt: A Tool for Holistic Comparison of Language Generation Systems

TL;DR

This paper introduces compare-mt, an open-source toolkit for holistic evaluation of language-generation systems that addresses the opacity of standard metrics. It provides aggregate-score, bucketed, n-gram difference, and sentence-example analyses, plus advanced features like label-wise abstraction, source-side analysis, and log-likelihood inspection. By enabling rapid identification of salient system differences, compare-mt guides targeted, fine-grained analyses and improvements. The authors demonstrate practical utility through case studies and emphasize the tool's extensibility and public availability.

Abstract

In this paper, we describe compare-mt, a tool for holistic analysis and comparison of the results of systems for language generation tasks such as machine translation. The main goal of the tool is to give the user a high-level and coherent view of the salient differences between systems that can then be used to guide further analysis or system improvement. It implements a number of tools to do so, such as analysis of accuracy of generation of particular types of words, bucketed histograms of sentence accuracies or counts based on salient characteristics, and extraction of characteristic -grams for each system. It also has a number of advanced features such as use of linguistic labels, source side data, or comparison of log likelihoods for probabilistic models, and also aims to be easily extensible by users to new types of analysis. The code is available at https://github.com/neulab/compare-mt

Paper Structure

This paper contains 15 sections, 2 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Workflow of using compare-mt for analysis of two systems
  • Figure 2: Analysis of word F-measure bucketed by frequency in the training set.
  • Figure 3: BLEU scores bucketed by sentence length.
  • Figure 4: Counts of sentences by length difference between the reference and the output.
  • Figure 5: Counts of sentences by sentence-level BLEU bucket.
  • ...and 2 more figures