Table of Contents
Fetching ...

Translation of Multifaceted Data without Re-Training of Machine Translation Systems

Hyeonseok Moon, Seungyoon Lee, Seongtae Hong, Seungjun Lee, Chanjun Park, Heuiseok Lim

TL;DR

This paper tackles the problem of translating multifaceted data points while preserving intra-data relations, which is often lost when translating components separately. It introduces a relation-aware translation pipeline that concatenates data components into a single sequence augmented with Indicator Tokens (IT) and Catalyst Statements (CS), enabling reversible translation and explicit modeling of inter-component relations without re-training MT systems. Empirical results across XNLI, Web Page Ranking, and Question Generation on multiple languages show that IT+CS improves both the quality of translated data and its effectiveness as training data, surpassing traditional component-wise translation in downstream tasks. The findings highlight the importance of intra-data relations for multilingual data translation and offer a practical, model-agnostic approach to generate high-quality multilingual training resources.

Abstract

Translating major language resources to build minor language resources becomes a widely-used approach. Particularly in translating complex data points composed of multiple components, it is common to translate each component separately. However, we argue that this practice often overlooks the interrelation between components within the same data point. To address this limitation, we propose a novel MT pipeline that considers the intra-data relation in implementing MT for training data. In our MT pipeline, all the components in a data point are concatenated to form a single translation sequence and subsequently reconstructed to the data components after translation. We introduce a Catalyst Statement (CS) to enhance the intra-data relation, and Indicator Token (IT) to assist the decomposition of a translated sequence into its respective data components. Through our approach, we have achieved a considerable improvement in translation quality itself, along with its effectiveness as training data. Compared with the conventional approach that translates each data component separately, our method yields better training data that enhances the performance of the trained model by 2.690 points for the web page ranking (WPR) task, and 0.845 for the question generation (QG) task in the XGLUE benchmark.

Translation of Multifaceted Data without Re-Training of Machine Translation Systems

TL;DR

This paper tackles the problem of translating multifaceted data points while preserving intra-data relations, which is often lost when translating components separately. It introduces a relation-aware translation pipeline that concatenates data components into a single sequence augmented with Indicator Tokens (IT) and Catalyst Statements (CS), enabling reversible translation and explicit modeling of inter-component relations without re-training MT systems. Empirical results across XNLI, Web Page Ranking, and Question Generation on multiple languages show that IT+CS improves both the quality of translated data and its effectiveness as training data, surpassing traditional component-wise translation in downstream tasks. The findings highlight the importance of intra-data relations for multilingual data translation and offer a practical, model-agnostic approach to generate high-quality multilingual training resources.

Abstract

Translating major language resources to build minor language resources becomes a widely-used approach. Particularly in translating complex data points composed of multiple components, it is common to translate each component separately. However, we argue that this practice often overlooks the interrelation between components within the same data point. To address this limitation, we propose a novel MT pipeline that considers the intra-data relation in implementing MT for training data. In our MT pipeline, all the components in a data point are concatenated to form a single translation sequence and subsequently reconstructed to the data components after translation. We introduce a Catalyst Statement (CS) to enhance the intra-data relation, and Indicator Token (IT) to assist the decomposition of a translated sequence into its respective data components. Through our approach, we have achieved a considerable improvement in translation quality itself, along with its effectiveness as training data. Compared with the conventional approach that translates each data component separately, our method yields better training data that enhances the performance of the trained model by 2.690 points for the web page ranking (WPR) task, and 0.845 for the question generation (QG) task in the XGLUE benchmark.
Paper Structure (27 sections, 3 equations, 12 figures, 10 tables)

This paper contains 27 sections, 3 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: Example of challenges in data translation
  • Figure 2: Relation-Aware translation pipeline. To explain the overall process, we assume data comprises two components: input sentence and label sentence. In this figure, (tr) represents the corresponding translated unit.
  • Figure 3: Data reversibility per NMT model and target dataset. For each data point, we create a single sequence by concatenating data components with a '#' symbol and examine the preservation rate of '#' in the translated sequence.
  • Figure 4: Reversibility after translation for IT and CS variants.
  • Figure 5: LLM evaluation results. We prompts ChatGPT to provide 0 - 5 scale quality score for each data point. Y-axis represents the quantity of instances which score is in the score range in X-axis.
  • ...and 7 more figures