Table of Contents
Fetching ...

Robust Guidance for Unsupervised Data Selection: Capturing Perplexing Named Entities for Domain-Specific Machine Translation

Seunghyun Ji, Hagai Raja Sinulingga, Darongsae Kwon

TL;DR

Low-resource MT faces domain mismatch and data scarcity, motivating unsupervised data selection to identify training-efficient data. The paper proposes Capturing Perplexing Named Entities (PerEnts), an unsupervised data-selection method that ranks data by the maximum entropy among translated named-entity tokens using a pre-trained MT model and a target-language NER. Across four Korean–English domain datasets (Medical, Travel, Law, Sports) and with IA3 fine-tuning on NLLB-1.3B, PerEnts achieves the strongest BLEU among unsupervised MDSs (≈34.09) and competitive ChrF++ and COMET scores, indicating robust, domain-agnostic guidance for data selection. The findings suggest prioritizing perplexing named entities during domain adaptation reduces labeling costs while improving translation quality in specialized domains, motivating further theoretical analysis of memorizable patterns and generalization.

Abstract

Low-resourced data presents a significant challenge for neural machine translation. In most cases, the low-resourced environment is caused by high costs due to the need for domain experts or the lack of language experts. Therefore, identifying the most training-efficient data within an unsupervised setting emerges as a practical strategy. Recent research suggests that such effective data can be identified by selecting 'appropriately complex data' based on its volume, providing strong intuition for unsupervised data selection. However, we have discovered that establishing criteria for unsupervised data selection remains a challenge, as the 'appropriate level of difficulty' may vary depending on the data domain. We introduce a novel unsupervised data selection method named 'Capturing Perplexing Named Entities,' which leverages the maximum inference entropy in translated named entities as a metric for selection. When tested with the 'Korean-English Parallel Corpus of Specialized Domains,' our method served as robust guidance for identifying training-efficient data across different domains, in contrast to existing methods.

Robust Guidance for Unsupervised Data Selection: Capturing Perplexing Named Entities for Domain-Specific Machine Translation

TL;DR

Low-resource MT faces domain mismatch and data scarcity, motivating unsupervised data selection to identify training-efficient data. The paper proposes Capturing Perplexing Named Entities (PerEnts), an unsupervised data-selection method that ranks data by the maximum entropy among translated named-entity tokens using a pre-trained MT model and a target-language NER. Across four Korean–English domain datasets (Medical, Travel, Law, Sports) and with IA3 fine-tuning on NLLB-1.3B, PerEnts achieves the strongest BLEU among unsupervised MDSs (≈34.09) and competitive ChrF++ and COMET scores, indicating robust, domain-agnostic guidance for data selection. The findings suggest prioritizing perplexing named entities during domain adaptation reduces labeling costs while improving translation quality in specialized domains, motivating further theoretical analysis of memorizable patterns and generalization.

Abstract

Low-resourced data presents a significant challenge for neural machine translation. In most cases, the low-resourced environment is caused by high costs due to the need for domain experts or the lack of language experts. Therefore, identifying the most training-efficient data within an unsupervised setting emerges as a practical strategy. Recent research suggests that such effective data can be identified by selecting 'appropriately complex data' based on its volume, providing strong intuition for unsupervised data selection. However, we have discovered that establishing criteria for unsupervised data selection remains a challenge, as the 'appropriate level of difficulty' may vary depending on the data domain. We introduce a novel unsupervised data selection method named 'Capturing Perplexing Named Entities,' which leverages the maximum inference entropy in translated named entities as a metric for selection. When tested with the 'Korean-English Parallel Corpus of Specialized Domains,' our method served as robust guidance for identifying training-efficient data across different domains, in contrast to existing methods.
Paper Structure (19 sections, 3 figures, 5 tables)

This paper contains 19 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: A diagram illustrates our method, which utilizes a pre-trained multilingual model for machine translation and a named entity recognition model that has been fine-tuned on the target language. Our method comprises three steps: 1) capturing named entity tokens in the machine-translated sentences, 2) calculating the inference entropy of those tokens, and 3) using the maximum entropy value as a measure for selection.
  • Figure 2: Pseudo code for the experiment data preparation. We sorted and split the data into 4 segments based on each value by MDS. Then, we sampled 2,000 sentences from each segment for fine-tuning.
  • Figure 3: The scores for each segment index across the four domains. The best BLEU scores among the segment indices were marked with a black star. Experimental results demonstrated that our method consistently identified the most training-efficient data by selecting the highest segment (3), whereas other methods varied by data domain.