Table of Contents
Fetching ...

MT-LENS: An all-in-one Toolkit for Better Machine Translation Evaluation

Javier García Gilabert, Carlos Escolano, Audrey Mash, Xixian Liao, Maite Melero

TL;DR

MT-Lens tackles the need for comprehensive MT evaluation beyond traditional quality metrics by integrating bias, toxicity, and robustness assessments into a single framework. It builds on the LM-eval-harness platform to extend MT evaluation, supporting diverse datasets and a broad set of metrics, with an interactive UI for per-segment and system-level analysis. The key contributions include a unified pipeline for running MT tasks, bootstrapped significance testing, and coverage of novel evaluation axes such as gender bias and added toxicity. This framework enables researchers and engineers to diagnose specific error types, compare systems robustly, and study MT behavior under perturbations, promoting more responsible and credible translation systems.

Abstract

We introduce MT-LENS, a framework designed to evaluate Machine Translation (MT) systems across a variety of tasks, including translation quality, gender bias detection, added toxicity, and robustness to misspellings. While several toolkits have become very popular for benchmarking the capabilities of Large Language Models (LLMs), existing evaluation tools often lack the ability to thoroughly assess the diverse aspects of MT performance. MT-LENS addresses these limitations by extending the capabilities of LM-eval-harness for MT, supporting state-of-the-art datasets and a wide range of evaluation metrics. It also offers a user-friendly platform to compare systems and analyze translations with interactive visualizations. MT-LENS aims to broaden access to evaluation strategies that go beyond traditional translation quality evaluation, enabling researchers and engineers to better understand the performance of a NMT model and also easily measure system's biases.

MT-LENS: An all-in-one Toolkit for Better Machine Translation Evaluation

TL;DR

MT-Lens tackles the need for comprehensive MT evaluation beyond traditional quality metrics by integrating bias, toxicity, and robustness assessments into a single framework. It builds on the LM-eval-harness platform to extend MT evaluation, supporting diverse datasets and a broad set of metrics, with an interactive UI for per-segment and system-level analysis. The key contributions include a unified pipeline for running MT tasks, bootstrapped significance testing, and coverage of novel evaluation axes such as gender bias and added toxicity. This framework enables researchers and engineers to diagnose specific error types, compare systems robustly, and study MT behavior under perturbations, promoting more responsible and credible translation systems.

Abstract

We introduce MT-LENS, a framework designed to evaluate Machine Translation (MT) systems across a variety of tasks, including translation quality, gender bias detection, added toxicity, and robustness to misspellings. While several toolkits have become very popular for benchmarking the capabilities of Large Language Models (LLMs), existing evaluation tools often lack the ability to thoroughly assess the diverse aspects of MT performance. MT-LENS addresses these limitations by extending the capabilities of LM-eval-harness for MT, supporting state-of-the-art datasets and a wide range of evaluation metrics. It also offers a user-friendly platform to compare systems and analyze translations with interactive visualizations. MT-LENS aims to broaden access to evaluation strategies that go beyond traditional translation quality evaluation, enabling researchers and engineers to better understand the performance of a NMT model and also easily measure system's biases.

Paper Structure

This paper contains 17 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Segment comparison with error spans produced by madlad-400-3B and NLLB-3.3B systems.
  • Figure 2: An image from the Perturbations page in the MT-Lens UI. Users can navigate between the following options: (1) Overview, (2) Translation, (3) Added Toxicity, (4) Gender Bias, and (5) Perturbations.