Table of Contents
Fetching ...

PAD: Towards Efficient Data Generation for Transfer Learning Using Phrase Alignment

Jong Myoung Kim, Young-Jun_Lee, Ho-Jin Choi, Sangkeun Jung

TL;DR

The paper tackles data scarcity for Korean NLP by leveraging abundant English resources. It introduces PAD, a phrase-aligned data generation method using SMT phrase alignment to create Korean-expressive training instances from English. Experiments show PAD improves transfer-learning efficiency and often matches or approaches native Korean or high-quality translated data while reducing cost. PAD complements existing data-construction methods and can leverage English abundance to augment resource-scarce languages, offering a practical, scalable baseline for industry use.

Abstract

Transfer learning leverages the abundance of English data to address the scarcity of resources in modeling non-English languages, such as Korean. In this study, we explore the potential of Phrase Aligned Data (PAD) from standardized Statistical Machine Translation (SMT) to enhance the efficiency of transfer learning. Through extensive experiments, we demonstrate that PAD synergizes effectively with the syntactic characteristics of the Korean language, mitigating the weaknesses of SMT and significantly improving model performance. Moreover, we reveal that PAD complements traditional data construction methods and enhances their effectiveness when combined. This innovative approach not only boosts model performance but also suggests a cost-efficient solution for resource-scarce languages.

PAD: Towards Efficient Data Generation for Transfer Learning Using Phrase Alignment

TL;DR

The paper tackles data scarcity for Korean NLP by leveraging abundant English resources. It introduces PAD, a phrase-aligned data generation method using SMT phrase alignment to create Korean-expressive training instances from English. Experiments show PAD improves transfer-learning efficiency and often matches or approaches native Korean or high-quality translated data while reducing cost. PAD complements existing data-construction methods and can leverage English abundance to augment resource-scarce languages, offering a practical, scalable baseline for industry use.

Abstract

Transfer learning leverages the abundance of English data to address the scarcity of resources in modeling non-English languages, such as Korean. In this study, we explore the potential of Phrase Aligned Data (PAD) from standardized Statistical Machine Translation (SMT) to enhance the efficiency of transfer learning. Through extensive experiments, we demonstrate that PAD synergizes effectively with the syntactic characteristics of the Korean language, mitigating the weaknesses of SMT and significantly improving model performance. Moreover, we reveal that PAD complements traditional data construction methods and enhances their effectiveness when combined. This innovative approach not only boosts model performance but also suggests a cost-efficient solution for resource-scarce languages.

Paper Structure

This paper contains 53 sections, 3 figures, 18 tables.

Figures (3)

  • Figure 1: Benchmark performance comparison by Korean-English mixed sentence ratios.
  • Figure 2: Example of looping in translations using mT5. The word "완전히(totaly or completely)" was repeated countless times.
  • Figure 3: Example of a mixed Korean-English sentence: When an input that does not exist in the learned probability table is encountered, the SMT outputs the expression in the source language as it is. The illustration shows that the expression "as soon as possible" is displayed in English due to the lack of information about this phrase.