Table of Contents
Fetching ...

MetricX-25 and GemSpanEval: Google Translate Submissions to the WMT25 Evaluation Shared Task

Juraj Juraska, Tobias Domhan, Mara Finkelstein, Tetsuji Nakagawa, Geza Kovacs, Daniel Deutsch, Pidong Wang, Markus Freitag

TL;DR

The work addresses scalable automatic MT evaluation by proposing two systems built on Gemma 3: MetricX-25, an encoder-only regression metric that predicts MQM and ESA scores using a unified input with score-type indicators, and GemSpanEval, a decoder-only generative model that outputs error spans with contextual information in JSON. MetricX-25 achieves superior segment-level correlations over its MetricX-24 predecessor and benefits from a two-stage training regime that blends DA and MQM data with synthetic examples, while GemSpanEval remains competitive with xComet and demonstrates how adding context to non-unique spans helps disambiguate errors. Training exclusively on public WMT data (2015–2024) and validating on WMT24 data demonstrates robustness across languages and tasks, reinforcing the viability of a single multilingual foundation model for both scoring and error-span generation in MT evaluation. The work highlights practical implications for real-world MT evaluation pipelines by delivering hybrid-input quality scoring and structured error feedback, enabling more precise and actionable evaluation results. Overall, MetricX-25 and GemSpanEval showcase how Gemma 3 can support both regression-based quality estimation and generative error-span detection, advancing automatic MT evaluation toward more nuanced and scalable assessments.

Abstract

In this paper, we present our submissions to the unified WMT25 Translation Evaluation Shared Task. For the Quality Score Prediction subtask, we create a new generation of MetricX with improvements in the input format and the training protocol, while for the Error Span Detection subtask we develop a new model, GemSpanEval, trained to predict error spans along with their severities and categories. Both systems are based on the state-of-the-art multilingual open-weights model Gemma 3, fine-tuned on publicly available WMT data. We demonstrate that MetricX-25, adapting Gemma 3 to an encoder-only architecture with a regression head on top, can be trained to effectively predict both MQM and ESA quality scores, and significantly outperforms its predecessor. Our decoder-only GemSpanEval model, on the other hand, we show to be competitive in error span detection with xCOMET, a strong encoder-only sequence-tagging baseline. With error span detection formulated as a generative task, we instruct the model to also output the context for each predicted error span, thus ensuring that error spans are identified unambiguously.

MetricX-25 and GemSpanEval: Google Translate Submissions to the WMT25 Evaluation Shared Task

TL;DR

The work addresses scalable automatic MT evaluation by proposing two systems built on Gemma 3: MetricX-25, an encoder-only regression metric that predicts MQM and ESA scores using a unified input with score-type indicators, and GemSpanEval, a decoder-only generative model that outputs error spans with contextual information in JSON. MetricX-25 achieves superior segment-level correlations over its MetricX-24 predecessor and benefits from a two-stage training regime that blends DA and MQM data with synthetic examples, while GemSpanEval remains competitive with xComet and demonstrates how adding context to non-unique spans helps disambiguate errors. Training exclusively on public WMT data (2015–2024) and validating on WMT24 data demonstrates robustness across languages and tasks, reinforcing the viability of a single multilingual foundation model for both scoring and error-span generation in MT evaluation. The work highlights practical implications for real-world MT evaluation pipelines by delivering hybrid-input quality scoring and structured error feedback, enabling more precise and actionable evaluation results. Overall, MetricX-25 and GemSpanEval showcase how Gemma 3 can support both regression-based quality estimation and generative error-span detection, advancing automatic MT evaluation toward more nuanced and scalable assessments.

Abstract

In this paper, we present our submissions to the unified WMT25 Translation Evaluation Shared Task. For the Quality Score Prediction subtask, we create a new generation of MetricX with improvements in the input format and the training protocol, while for the Error Span Detection subtask we develop a new model, GemSpanEval, trained to predict error spans along with their severities and categories. Both systems are based on the state-of-the-art multilingual open-weights model Gemma 3, fine-tuned on publicly available WMT data. We demonstrate that MetricX-25, adapting Gemma 3 to an encoder-only architecture with a regression head on top, can be trained to effectively predict both MQM and ESA quality scores, and significantly outperforms its predecessor. Our decoder-only GemSpanEval model, on the other hand, we show to be competitive in error span detection with xCOMET, a strong encoder-only sequence-tagging baseline. With error span detection formulated as a generative task, we instruct the model to also output the context for each predicted error span, thus ensuring that error spans are identified unambiguously.

Paper Structure

This paper contains 25 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Example MetricX-25 model input.
  • Figure 2: Example translation with non-unique error spans, where span context text is included.
  • Figure 3: Example prompt and response for AutoMQM error span identification. We omit the error span attribute is_source_error for brevity. Each span that is not unique receives an additional attribute span_with_context.