Table of Contents
Fetching ...

Predicting Anchored Text from Translation Memories for Machine Translation Using Deep Learning Methods

Richard Yue, John E. Ortega

TL;DR

The paper tackles the problem of predicting anchored words during fuzzy-match repair in computer-aided translation, focusing on single-word gaps between known contextual tokens in translation memories. It evaluates four approaches—Neural MT via Open-NMT, DistilBERT MLM, Word2Vec CBOW, and GPT-4 prompting—on a large European parliamentary corpus of 393,371 SL-TL pairs. Results indicate that BERT-based predictions yield the highest character-level accuracy and anchored-word coverage across fuzzy-match thresholds, with GPT-4 remaining competitive, and anchored-trigram MT often outperforming full-segment MT for anchoring tasks. The work demonstrates practical potential to augment CAT workflows with LM-based anchored-word predictions and guides integration strategies for translation-memory–driven translation pipelines.

Abstract

Translation memories (TMs) are the backbone for professional translation tools called computer-aided translation (CAT) tools. In order to perform a translation using a CAT tool, a translator uses the TM to gather translations similar to the desired segment to translate (s'). Many CAT tools offer a fuzzy-match algorithm to locate segments (s) in the TM that are close in distance to s'. After locating two similar segments, the CAT tool will present parallel segments (s, t) that contain one segment in the source language along with its translation in the target language. Additionally, CAT tools contain fuzzy-match repair (FMR) techniques that will automatically use the parallel segments from the TM to create new TM entries containing a modified version of the original with the idea in mind that it will be the translation of s'. Most FMR techniques use machine translation as a way of "repairing" those words that have to be modified. In this article, we show that for a large part of those words which are anchored, we can use other techniques that are based on machine learning approaches such as Word2Vec. BERT, and even ChatGPT. Specifically, we show that for anchored words that follow the continuous bag-of-words (CBOW) paradigm, Word2Vec, BERT, and GPT-4 can be used to achieve similar and, for some cases, better results than neural machine translation for translating anchored words from French to English.

Predicting Anchored Text from Translation Memories for Machine Translation Using Deep Learning Methods

TL;DR

The paper tackles the problem of predicting anchored words during fuzzy-match repair in computer-aided translation, focusing on single-word gaps between known contextual tokens in translation memories. It evaluates four approaches—Neural MT via Open-NMT, DistilBERT MLM, Word2Vec CBOW, and GPT-4 prompting—on a large European parliamentary corpus of 393,371 SL-TL pairs. Results indicate that BERT-based predictions yield the highest character-level accuracy and anchored-word coverage across fuzzy-match thresholds, with GPT-4 remaining competitive, and anchored-trigram MT often outperforming full-segment MT for anchoring tasks. The work demonstrates practical potential to augment CAT workflows with LM-based anchored-word predictions and guides integration strategies for translation-memory–driven translation pipelines.

Abstract

Translation memories (TMs) are the backbone for professional translation tools called computer-aided translation (CAT) tools. In order to perform a translation using a CAT tool, a translator uses the TM to gather translations similar to the desired segment to translate (s'). Many CAT tools offer a fuzzy-match algorithm to locate segments (s) in the TM that are close in distance to s'. After locating two similar segments, the CAT tool will present parallel segments (s, t) that contain one segment in the source language along with its translation in the target language. Additionally, CAT tools contain fuzzy-match repair (FMR) techniques that will automatically use the parallel segments from the TM to create new TM entries containing a modified version of the original with the idea in mind that it will be the translation of s'. Most FMR techniques use machine translation as a way of "repairing" those words that have to be modified. In this article, we show that for a large part of those words which are anchored, we can use other techniques that are based on machine learning approaches such as Word2Vec. BERT, and even ChatGPT. Specifically, we show that for anchored words that follow the continuous bag-of-words (CBOW) paradigm, Word2Vec, BERT, and GPT-4 can be used to achieve similar and, for some cases, better results than neural machine translation for translating anchored words from French to English.
Paper Structure (15 sections, 2 figures, 3 tables)

This paper contains 15 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: An illustration of predicting a word given the context around it (denoted as anchored words in this article), called Continuos Bag of Words (CBOW) by mikolov2013efficient.
  • Figure 2: Average character match (y-axis) by fuzzy-match rate percentage (x-axis) by segment of the four experimental approaches: BERT, GPT, Word2Vec, Neural Machine Translation 1 and Neural Machine Translation 2 systems for different segment-level fuzzy-match thresholds.