Table of Contents
Fetching ...

LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation

Yongjing Yin, Jiali Zeng, Yafu Li, Fandong Meng, Yue Zhang

TL;DR

This paper presents LexMatcher, a simple yet effective method for data curation, the design of which is driven by the coverage of senses found in bilingual dictionaries, and outperforms the established baselines on the WMT2022 test sets and exhibits remarkable performance in tasks related to word sense disambiguation and specialized terminology translation.

Abstract

The fine-tuning of open-source large language models (LLMs) for machine translation has recently received considerable attention, marking a shift towards data-centric research from traditional neural machine translation. However, the area of data collection for instruction fine-tuning in machine translation remains relatively underexplored. In this paper, we present LexMatcher, a simple yet effective method for data curation, the design of which is driven by the coverage of senses found in bilingual dictionaries. The construction process comprises data retrieval from an existing corpus and data augmentation that supplements the infrequent senses of polysemous words. Utilizing LLaMA2 as our base model, our approach outperforms the established baselines on the WMT2022 test sets and also exhibits remarkable performance in tasks related to word sense disambiguation and specialized terminology translation. These results underscore the effectiveness of LexMatcher in enhancing LLM-based machine translation. The code, data, and models are available at https://github.com/ARIES-LM/Lexmatcher-MT.git.

LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation

TL;DR

This paper presents LexMatcher, a simple yet effective method for data curation, the design of which is driven by the coverage of senses found in bilingual dictionaries, and outperforms the established baselines on the WMT2022 test sets and exhibits remarkable performance in tasks related to word sense disambiguation and specialized terminology translation.

Abstract

The fine-tuning of open-source large language models (LLMs) for machine translation has recently received considerable attention, marking a shift towards data-centric research from traditional neural machine translation. However, the area of data collection for instruction fine-tuning in machine translation remains relatively underexplored. In this paper, we present LexMatcher, a simple yet effective method for data curation, the design of which is driven by the coverage of senses found in bilingual dictionaries. The construction process comprises data retrieval from an existing corpus and data augmentation that supplements the infrequent senses of polysemous words. Utilizing LLaMA2 as our base model, our approach outperforms the established baselines on the WMT2022 test sets and also exhibits remarkable performance in tasks related to word sense disambiguation and specialized terminology translation. These results underscore the effectiveness of LexMatcher in enhancing LLM-based machine translation. The code, data, and models are available at https://github.com/ARIES-LM/Lexmatcher-MT.git.
Paper Structure (28 sections, 3 equations, 6 figures, 8 tables, 1 algorithm)

This paper contains 28 sections, 3 equations, 6 figures, 8 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustration of our LexMatcher for instruction fine-tuning smaller LLMs (e.g., LLaMA).
  • Figure 2: Zero-shot translation.
  • Figure 3: BLEU and COMET on the WMT22 test sets with varying $K$ and model sizes.
  • Figure 4: Performance of different data selection strategies.
  • Figure 5: Word frequency distributions. The blue and gray curves denote the distributions calculated on the data selected by LexMatcher (K=1) and randomly selected data, respectively.
  • ...and 1 more figures