Table of Contents
Fetching ...

Human Evaluation of English--Irish Transformer-Based NMT

Séamus Lankford, Haithem Afli, Andy Way

TL;DR

This work tackles English–Irish neural machine translation in a low-resource setting, comparing RNN baselines with Transformer variants while exploring subword tokenization via SentencePiece (BPE and unigram). By combining random hyperparameter search, subword-size experiments, and a dual human evaluation (SQM and MQM) with automatic metrics, the authors show that a Transformer model using a 16k BPE subword model achieves the strongest performance, significantly surpassing the RNN baseline and Google Translate on the DGT EN–GA task. The study demonstrates a strong correlation between automatic metrics and fine-grained human judgments, and highlights the impact of hyperparameters and subword choices on translation quality for morphologically rich, low-resource languages. Practically, these findings inform best practices for low-resource MT development, including careful subword selection, moderate regularization, and the value of explicit linguistic error analysis via MQM/SQM.

Abstract

In this study, a human evaluation is carried out on how hyperparameter settings impact the quality of Transformer-based Neural Machine Translation (NMT) for the low-resourced English--Irish pair. SentencePiece models using both Byte Pair Encoding (BPE) and unigram approaches were appraised. Variations in model architectures included modifying the number of layers, evaluating the optimal number of heads for attention and testing various regularisation techniques. The greatest performance improvement was recorded for a Transformer-optimized model with a 16k BPE subword model. Compared with a baseline Recurrent Neural Network (RNN) model, a Transformer-optimized model demonstrated a BLEU score improvement of 7.8 points. When benchmarked against Google Translate, our translation engines demonstrated significant improvements. Furthermore, a quantitative fine-grained manual evaluation was conducted which compared the performance of machine translation systems. Using the Multidimensional Quality Metrics (MQM) error taxonomy, a human evaluation of the error types generated by an RNN-based system and a Transformer-based system was explored. Our findings show the best-performing Transformer system significantly reduces both accuracy and fluency errors when compared with an RNN-based model.

Human Evaluation of English--Irish Transformer-Based NMT

TL;DR

This work tackles English–Irish neural machine translation in a low-resource setting, comparing RNN baselines with Transformer variants while exploring subword tokenization via SentencePiece (BPE and unigram). By combining random hyperparameter search, subword-size experiments, and a dual human evaluation (SQM and MQM) with automatic metrics, the authors show that a Transformer model using a 16k BPE subword model achieves the strongest performance, significantly surpassing the RNN baseline and Google Translate on the DGT EN–GA task. The study demonstrates a strong correlation between automatic metrics and fine-grained human judgments, and highlights the impact of hyperparameters and subword choices on translation quality for morphologically rich, low-resource languages. Practically, these findings inform best practices for low-resource MT development, including careful subword selection, moderate regularization, and the value of explicit linguistic error analysis via MQM/SQM.

Abstract

In this study, a human evaluation is carried out on how hyperparameter settings impact the quality of Transformer-based Neural Machine Translation (NMT) for the low-resourced English--Irish pair. SentencePiece models using both Byte Pair Encoding (BPE) and unigram approaches were appraised. Variations in model architectures included modifying the number of layers, evaluating the optimal number of heads for attention and testing various regularisation techniques. The greatest performance improvement was recorded for a Transformer-optimized model with a 16k BPE subword model. Compared with a baseline Recurrent Neural Network (RNN) model, a Transformer-optimized model demonstrated a BLEU score improvement of 7.8 points. When benchmarked against Google Translate, our translation engines demonstrated significant improvements. Furthermore, a quantitative fine-grained manual evaluation was conducted which compared the performance of machine translation systems. Using the Multidimensional Quality Metrics (MQM) error taxonomy, a human evaluation of the error types generated by an RNN-based system and a Transformer-based system was explored. Our findings show the best-performing Transformer system significantly reduces both accuracy and fluency errors when compared with an RNN-based model.
Paper Structure (35 sections, 6 figures, 12 tables)

This paper contains 35 sections, 6 figures, 12 tables.

Figures (6)

  • Figure S1: The proposed approach to evaluate the baseline architectures of RNN and Transformer models is illustrated above. Using a random search approach, the values outlined in Table \ref{['tab:hpo-table']} were tested to determine the optimal hyperparameters. Short cycles of 5k training steps were applied to test a range of values for each parameter. Once an optimal value was identified within the sampled range, it was locked in for tests on subsequent parameters. A fine-grained HE was conducted on the output from the DGT dataset and its results were compared with an automatic evaluation.
  • Figure S2: The core set of error categories proposed by the MQM guidelines.
  • Figure S3: BLEU performance for all model architectures is compared. The use of a BPE subword model improved translation performance in all cases. The best-performing model was built using a 16k BPE subword model on a Transformer architecture.
  • Figure S4: TER performance for all model architectures. The highest-performing model uses a 16k BPE subword model on a Transformer architecture. In all instances, incorporating a subword model improves TER.
  • Figure S5: Transformer baseline.
  • ...and 1 more figures