MT-LENS: An all-in-one Toolkit for Better Machine Translation Evaluation
Javier García Gilabert, Carlos Escolano, Audrey Mash, Xixian Liao, Maite Melero
TL;DR
MT-Lens tackles the need for comprehensive MT evaluation beyond traditional quality metrics by integrating bias, toxicity, and robustness assessments into a single framework. It builds on the LM-eval-harness platform to extend MT evaluation, supporting diverse datasets and a broad set of metrics, with an interactive UI for per-segment and system-level analysis. The key contributions include a unified pipeline for running MT tasks, bootstrapped significance testing, and coverage of novel evaluation axes such as gender bias and added toxicity. This framework enables researchers and engineers to diagnose specific error types, compare systems robustly, and study MT behavior under perturbations, promoting more responsible and credible translation systems.
Abstract
We introduce MT-LENS, a framework designed to evaluate Machine Translation (MT) systems across a variety of tasks, including translation quality, gender bias detection, added toxicity, and robustness to misspellings. While several toolkits have become very popular for benchmarking the capabilities of Large Language Models (LLMs), existing evaluation tools often lack the ability to thoroughly assess the diverse aspects of MT performance. MT-LENS addresses these limitations by extending the capabilities of LM-eval-harness for MT, supporting state-of-the-art datasets and a wide range of evaluation metrics. It also offers a user-friendly platform to compare systems and analyze translations with interactive visualizations. MT-LENS aims to broaden access to evaluation strategies that go beyond traditional translation quality evaluation, enabling researchers and engineers to better understand the performance of a NMT model and also easily measure system's biases.
