Table of Contents
Fetching ...

Prediction of Translation Techniques for the Translation Process

Fan Zhou, Vincent Vandeghinste

TL;DR

This work investigates predicting human translation techniques to guide machine translation under two workflows: from-scratch translation and post-editing. It builds a English-Chinese dataset of over 100,000 aligned units labeled with translation techniques and tests four encoder-based architectures using cross-lingual models (mBERT, mBART, mT5) to forecast the most suitable techniques. The results show 82% predictive accuracy for from-scratch translation and 93% for post-editing, indicating strong potential for pre-translation guidance and prompting for large language models. Limitations include the focus on pre-translation phases, not directly integrating techniques into decoders, and challenges in data availability and sub-sentence alignment automation; future work aims to incorporate techniques into the decoder and automate alignment to scale the approach.

Abstract

Machine translation (MT) encompasses a variety of methodologies aimed at enhancing the accuracy of translations. In contrast, the process of human-generated translation relies on a wide range of translation techniques, which are crucial for ensuring linguistic adequacy and fluency. This study suggests that these translation techniques could further optimize machine translation if they are automatically identified before being applied to guide the translation process effectively. The study differentiates between two scenarios of the translation process: from-scratch translation and post-editing. For each scenario, a specific set of experiments has been designed to forecast the most appropriate translation techniques. The findings indicate that the predictive accuracy for from-scratch translation reaches 82%, while the post-editing process exhibits even greater potential, achieving an accuracy rate of 93%.

Prediction of Translation Techniques for the Translation Process

TL;DR

This work investigates predicting human translation techniques to guide machine translation under two workflows: from-scratch translation and post-editing. It builds a English-Chinese dataset of over 100,000 aligned units labeled with translation techniques and tests four encoder-based architectures using cross-lingual models (mBERT, mBART, mT5) to forecast the most suitable techniques. The results show 82% predictive accuracy for from-scratch translation and 93% for post-editing, indicating strong potential for pre-translation guidance and prompting for large language models. Limitations include the focus on pre-translation phases, not directly integrating techniques into decoders, and challenges in data availability and sub-sentence alignment automation; future work aims to incorporate techniques into the decoder and automate alignment to scale the approach.

Abstract

Machine translation (MT) encompasses a variety of methodologies aimed at enhancing the accuracy of translations. In contrast, the process of human-generated translation relies on a wide range of translation techniques, which are crucial for ensuring linguistic adequacy and fluency. This study suggests that these translation techniques could further optimize machine translation if they are automatically identified before being applied to guide the translation process effectively. The study differentiates between two scenarios of the translation process: from-scratch translation and post-editing. For each scenario, a specific set of experiments has been designed to forecast the most appropriate translation techniques. The findings indicate that the predictive accuracy for from-scratch translation reaches 82%, while the post-editing process exhibits even greater potential, achieving an accuracy rate of 93%.
Paper Structure (16 sections, 1 equation, 4 figures, 4 tables)

This paper contains 16 sections, 1 equation, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Comparison of English-Chinese Translation Examples: Machine Translation vs. Human Translation
  • Figure 2: Experiments feature four distinct architectures: (a) and (b) are designated for from-scratch translation, while (c) and (d) are tailored for post-editing tasks. The input data is structured in two formats: Input1 and Input2, with comprehensive specifics available in Figure \ref{['figure_chinese2']}. 'n' of the Softmax-n layer corresponds to the number of categories. 'TT' signifies a specific translation technique.
  • Figure 3: Data Input Formats
  • Figure 4: Translation techniques' prediction heatmap. The statistical number of each translation technique is averaged from 3 models in each architecture. Non-diagonal X-axis represents false positive and non-diagonal Y-axis represents false negative.