Table of Contents
Fetching ...

Textual Similarity as a Key Metric in Machine Translation Quality Estimation

Kun Sun, Rong Wang

TL;DR

This work reframes MT quality estimation by introducing textual similarity, computed from multilingual sentence-transformer embeddings and cosine similarity, as a key QE metric. Using GAMMs on the MLQE-PE and PreQuEL datasets, it shows textual similarity correlates more strongly with human judgments than traditional metrics such as ML_eval and hter, and demonstrates its robust predictive value across language pairs. The findings argue for incorporating textual similarity into QE pipelines and MT system training to improve translation reliability and end-user usability, while also noting the advantages and limitations of existing metrics. Overall, textual similarity emerges as a practical, high-signal metric that enhances QE when combined with other features in real-world MT systems.

Abstract

Machine Translation (MT) Quality Estimation (QE) assesses translation reliability without reference texts. This study introduces "textual similarity" as a new metric for QE, using sentence transformers and cosine similarity to measure semantic closeness. Analyzing data from the MLQE-PE dataset, we found that textual similarity exhibits stronger correlations with human scores than traditional metrics (hter, model evaluation, sentence probability etc.). Employing GAMMs as a statistical tool, we demonstrated that textual similarity consistently outperforms other metrics across multiple language pairs in predicting human scores. We also found that "hter" actually failed to predict human scores in QE. Our findings highlight the effectiveness of textual similarity as a robust QE metric, recommending its integration with other metrics into QE frameworks and MT system training for improved accuracy and usability.

Textual Similarity as a Key Metric in Machine Translation Quality Estimation

TL;DR

This work reframes MT quality estimation by introducing textual similarity, computed from multilingual sentence-transformer embeddings and cosine similarity, as a key QE metric. Using GAMMs on the MLQE-PE and PreQuEL datasets, it shows textual similarity correlates more strongly with human judgments than traditional metrics such as ML_eval and hter, and demonstrates its robust predictive value across language pairs. The findings argue for incorporating textual similarity into QE pipelines and MT system training to improve translation reliability and end-user usability, while also noting the advantages and limitations of existing metrics. Overall, textual similarity emerges as a practical, high-signal metric that enhances QE when combined with other features in real-world MT systems.

Abstract

Machine Translation (MT) Quality Estimation (QE) assesses translation reliability without reference texts. This study introduces "textual similarity" as a new metric for QE, using sentence transformers and cosine similarity to measure semantic closeness. Analyzing data from the MLQE-PE dataset, we found that textual similarity exhibits stronger correlations with human scores than traditional metrics (hter, model evaluation, sentence probability etc.). Employing GAMMs as a statistical tool, we demonstrated that textual similarity consistently outperforms other metrics across multiple language pairs in predicting human scores. We also found that "hter" actually failed to predict human scores in QE. Our findings highlight the effectiveness of textual similarity as a robust QE metric, recommending its integration with other metrics into QE frameworks and MT system training for improved accuracy and usability.
Paper Structure (10 sections, 4 figures, 2 tables)

This paper contains 10 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: correlations among various factors in MLQE-PE
  • Figure 2: The partial effects on human score from different factors for MLQE-PE. The x-axis represents the specific metric being analyzed, while the y-axis indicates human score. Each curve within a plot illustrates the relationship between a predictor variable (plotted on the x-axis) and the response variable. Steeper slopes on these curves indicate a stronger influence of the predictor variable on human score. Conversely, gentler slopes imply a weaker influence, indicating that changes in the predictor variable have a less pronounced effect on human score. Such plots could give deep insights on the relationship between one given metric and human score.
  • Figure 3: The partial effects on human score from different factors for PreQuEL. The x-axis represents the specific metric being analyzed, while the y-axis indicates the human score. The interpretation of the curve is the same as in Fig \ref{['fig:pe']}. The layout here differs slightly from that in Fig \ref{['fig:pe']}. $\Delta$AIC values are compared among three plots for the same language pairs. A lower $\Delta$AIC value indicates better performance.
  • Figure 4: correlations among various factors in PreQuEL