Understanding the effects of word-level linguistic annotations in under-resourced neural machine translation

Víctor M. Sánchez-Cartagena; Juan Antonio Pérez-Ortiz; Felipe Sánchez-Martínez

Understanding the effects of word-level linguistic annotations in under-resourced neural machine translation

Víctor M. Sánchez-Cartagena, Juan Antonio Pérez-Ortiz, Felipe Sánchez-Martínez

TL;DR

The paper tackles how word-level linguistic annotations affect under-resourced neural machine translation by systematically interleaving POS and MSD tags into SL inputs or TL outputs across eight language pairs and two architectures. It shows SL annotations improve sentence representations and that TL POS tags typically outperform TL MSD tags on automatic metrics, although MSD can improve grammaticality; combining SL MSD with TL POS often yields the best results. The study also demonstrates that gains from interleaving scale with recurrent architectures but are less evident for Transformers in large-data settings, and it highlights trade-offs between tag accuracy and surface-form generation. These findings offer practical guidance for deploying linguistic annotations in low-resource NMT, favoring SL MSD and TL POS tags and recommending architecture-aware, decomposed handling of TL morphology to avoid degrading lexical accuracy.

Abstract

This paper studies the effects of word-level linguistic annotations in under-resourced neural machine translation, for which there is incomplete evidence in the literature. The study covers eight language pairs, different training corpus sizes, two architectures, and three types of annotation: dummy tags (with no linguistic information at all), part-of-speech tags, and morpho-syntactic description tags, which consist of part of speech and morphological features. These linguistic annotations are interleaved in the input or output streams as a single tag placed before each word. In order to measure the performance under each scenario, we use automatic evaluation metrics and perform automatic error classification. Our experiments show that, in general, source-language annotations are helpful and morpho-syntactic descriptions outperform part of speech for some language pairs. On the contrary, when words are annotated in the target language, part-of-speech tags systematically outperform morpho-syntactic description tags in terms of automatic evaluation metrics, even though the use of morpho-syntactic description tags improves the grammaticality of the output. We provide a detailed analysis of the reasons behind this result.

Understanding the effects of word-level linguistic annotations in under-resourced neural machine translation

TL;DR

Abstract

Paper Structure (18 sections, 4 figures, 5 tables)

This paper contains 18 sections, 4 figures, 5 tables.

Introduction
Interleaving in neural machine translation
Experimental settings
Corpora.
Translation models.
Error classification.
Results and discussion
Translation into a morphologically rich language.
Translation from a morphologically rich language.
Large-scale training data.
Main findings.
Error analysis
SL tags.
TL tags.
Differences between architectures.
...and 3 more sections

Figures (4)

Figure 1: For language pairs with English as SL, relative changes in the number of errors for each error category, training corpus size and type of interleaved tag.
Figure 2: For language pairs with English as TL, relative changes in the number of errors for each error category, training corpus size and type of interleaved tag.
Figure 3: For language pairs with English as SL, tag prediction accuracy (labelled as Tags) and surface form prediction accuracy (labelled as S.F.) forcing, respectively, surface forms and tags from the reference.
Figure 4: POS prediction accuracy (labelled as POS) and surface form prediction accuracy (labelled as S.F.) forcing, respectively, surface forms and POS tags from the reference.

Understanding the effects of word-level linguistic annotations in under-resourced neural machine translation

TL;DR

Abstract

Understanding the effects of word-level linguistic annotations in under-resourced neural machine translation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)