Table of Contents
Fetching ...

MTQE.en-he: Machine Translation Quality Estimation for English-Hebrew

Andy Rosenbaum, Assaf Siani, Ilan Kernerman

TL;DR

This work introduces MTQE.en-he, the first publicly released MTQE benchmark for English–Hebrew, comprising 959 segments with three expert Direct Assessment scores. It benchmarks ChatGPT prompting, TransQuest, and CometKiwi, finding that ensemble predictions outperform any single model. The paper also investigates fine-tuning strategies, showing that full-model updates cause overfitting while parameter-efficient approaches (LoRA, BitFit, FTHead) provide stable 2–3 point improvements. Together, the dataset and results offer a resource for MTQE in a low-resource language pair and point to future directions like synthetic data augmentation and calibration.

Abstract

We release MTQE.en-he: to our knowledge, the first publicly available English-Hebrew benchmark for Machine Translation Quality Estimation. MTQE.en-he contains 959 English segments from WMT24++, each paired with a machine translation into Hebrew, and Direct Assessment scores of the translation quality annotated by three human experts. We benchmark ChatGPT prompting, TransQuest, and CometKiwi and show that ensembling the three models outperforms the best single model (CometKiwi) by 6.4 percentage points Pearson and 5.6 percentage points Spearman. Fine-tuning experiments with TransQuest and CometKiwi reveal that full-model updates are sensitive to overfitting and distribution collapse, yet parameter-efficient methods (LoRA, BitFit, and FTHead, i.e., fine-tuning only the classification head) train stably and yield improvements of 2-3 percentage points. MTQE.en-he and our experimental results enable future research on this under-resourced language pair.

MTQE.en-he: Machine Translation Quality Estimation for English-Hebrew

TL;DR

This work introduces MTQE.en-he, the first publicly released MTQE benchmark for English–Hebrew, comprising 959 segments with three expert Direct Assessment scores. It benchmarks ChatGPT prompting, TransQuest, and CometKiwi, finding that ensemble predictions outperform any single model. The paper also investigates fine-tuning strategies, showing that full-model updates cause overfitting while parameter-efficient approaches (LoRA, BitFit, FTHead) provide stable 2–3 point improvements. Together, the dataset and results offer a resource for MTQE in a low-resource language pair and point to future directions like synthetic data augmentation and calibration.

Abstract

We release MTQE.en-he: to our knowledge, the first publicly available English-Hebrew benchmark for Machine Translation Quality Estimation. MTQE.en-he contains 959 English segments from WMT24++, each paired with a machine translation into Hebrew, and Direct Assessment scores of the translation quality annotated by three human experts. We benchmark ChatGPT prompting, TransQuest, and CometKiwi and show that ensembling the three models outperforms the best single model (CometKiwi) by 6.4 percentage points Pearson and 5.6 percentage points Spearman. Fine-tuning experiments with TransQuest and CometKiwi reveal that full-model updates are sensitive to overfitting and distribution collapse, yet parameter-efficient methods (LoRA, BitFit, and FTHead, i.e., fine-tuning only the classification head) train stably and yield improvements of 2-3 percentage points. MTQE.en-he and our experimental results enable future research on this under-resourced language pair.
Paper Structure (17 sections, 8 figures, 6 tables)

This paper contains 17 sections, 8 figures, 6 tables.

Figures (8)

  • Figure 1: MTQE.en-he dataset statistics (n=959). Left: Mean Score. Right: Number of words in English source.
  • Figure 2: Scatter plots of baseline model hypotheses on the full dataset (n=959).
  • Figure 3: Visualization of effects of fine-tuning methods (seed 0). Each row represents the specified fine-tuning method. Left: baseline TransQuest scatter plot on test set (n=559), the same for each row. Center: best checkpoint scatter plot on the same test set. Right: learning curve on train and validation during fine-tuning.
  • Figure 4: Baseline TransQuest Scores on validation set (seed 0). Starting point for fine-tuning plots in Figure \ref{['fig:FTEpochs']}.
  • Figure 5: Scatter plots of predicted vs. truth score of fine-tuning methods across epochs. The plots are on the validation set for seed 0 runs. Rows correspond to epochs; columns correspond to fine-tuning strategies.
  • ...and 3 more figures