Human Evaluation of English--Irish Transformer-Based NMT
Séamus Lankford, Haithem Afli, Andy Way
TL;DR
This work tackles English–Irish neural machine translation in a low-resource setting, comparing RNN baselines with Transformer variants while exploring subword tokenization via SentencePiece (BPE and unigram). By combining random hyperparameter search, subword-size experiments, and a dual human evaluation (SQM and MQM) with automatic metrics, the authors show that a Transformer model using a 16k BPE subword model achieves the strongest performance, significantly surpassing the RNN baseline and Google Translate on the DGT EN–GA task. The study demonstrates a strong correlation between automatic metrics and fine-grained human judgments, and highlights the impact of hyperparameters and subword choices on translation quality for morphologically rich, low-resource languages. Practically, these findings inform best practices for low-resource MT development, including careful subword selection, moderate regularization, and the value of explicit linguistic error analysis via MQM/SQM.
Abstract
In this study, a human evaluation is carried out on how hyperparameter settings impact the quality of Transformer-based Neural Machine Translation (NMT) for the low-resourced English--Irish pair. SentencePiece models using both Byte Pair Encoding (BPE) and unigram approaches were appraised. Variations in model architectures included modifying the number of layers, evaluating the optimal number of heads for attention and testing various regularisation techniques. The greatest performance improvement was recorded for a Transformer-optimized model with a 16k BPE subword model. Compared with a baseline Recurrent Neural Network (RNN) model, a Transformer-optimized model demonstrated a BLEU score improvement of 7.8 points. When benchmarked against Google Translate, our translation engines demonstrated significant improvements. Furthermore, a quantitative fine-grained manual evaluation was conducted which compared the performance of machine translation systems. Using the Multidimensional Quality Metrics (MQM) error taxonomy, a human evaluation of the error types generated by an RNN-based system and a Transformer-based system was explored. Our findings show the best-performing Transformer system significantly reduces both accuracy and fluency errors when compared with an RNN-based model.
