Table of Contents
Fetching ...

Dynamic Jointly Batch Selection for Data Efficient Machine Translation Fine-Tuning

Mohammad Amin Ghanizadeh, Mohammad Javad Dousti

TL;DR

The paper tackles data efficiency in MT fine-tuning by introducing an online data selection method based on a learnability score $s^{learn}(B|\theta,\theta^*) = s^{hard}(B,\theta) + s^{easy}(B,\theta^*)$, where $s^{hard}(B,\theta) = - H_{\theta}(B_{src}) H_{\theta}(B_{trg})$ and $s^{easy}(B,\theta^*) = H_{\theta^*}(B_{src}) H_{\theta^*}(B_{trg})$. Embeddings from the learner and a fixed reference model form a learnability matrix over super-batches (e.g., $2048\times1024$) and guide an iterative sub-batch selection to update the MT model. The approach yields up to fivefold data efficiency versus iid training, smoother loss trajectories, and improved generalization across 12 translation directions when fine-tuning with an $m$BART$ on CCMatrix, with embedding caching reducing relative FLOPS. This data-driven batching strategy is particularly advantageous in low-resource or noisy data regimes and demonstrates robust gains across multilingual MT tasks.

Abstract

Data quality and its effective selection are fundamental to improving the performance of machine translation models, serving as cornerstones for achieving robust and reliable translation systems. This paper presents a data selection methodology specifically designed for fine-tuning machine translation systems, which leverages the synergy between a learner model and a pre-trained reference model to enhance overall training effectiveness. By defining a learnability score, our approach systematically evaluates the utility of data points for training, ensuring that only the most relevant and impactful examples contribute to the fine-tuning process. Furthermore, our method employs a batch selection strategy which considers interdependencies among data points, optimizing the efficiency of the training process while maintaining a focus on data relevance. Experiments on English to Persian and several other language pairs using an mBART model fine-tuned on the CCMatrix dataset demonstrate that our method can achieve up to a fivefold improvement in data efficiency compared to an iid baseline. Experimental results indicate that our approach improves computational efficiency by 24 when utilizing cached embeddings, as it requires fewer training data points. Additionally, it enhances generalization, resulting in superior translation performance compared to random selection method.

Dynamic Jointly Batch Selection for Data Efficient Machine Translation Fine-Tuning

TL;DR

The paper tackles data efficiency in MT fine-tuning by introducing an online data selection method based on a learnability score , where and . Embeddings from the learner and a fixed reference model form a learnability matrix over super-batches (e.g., ) and guide an iterative sub-batch selection to update the MT model. The approach yields up to fivefold data efficiency versus iid training, smoother loss trajectories, and improved generalization across 12 translation directions when fine-tuning with an BART$ on CCMatrix, with embedding caching reducing relative FLOPS. This data-driven batching strategy is particularly advantageous in low-resource or noisy data regimes and demonstrates robust gains across multilingual MT tasks.

Abstract

Data quality and its effective selection are fundamental to improving the performance of machine translation models, serving as cornerstones for achieving robust and reliable translation systems. This paper presents a data selection methodology specifically designed for fine-tuning machine translation systems, which leverages the synergy between a learner model and a pre-trained reference model to enhance overall training effectiveness. By defining a learnability score, our approach systematically evaluates the utility of data points for training, ensuring that only the most relevant and impactful examples contribute to the fine-tuning process. Furthermore, our method employs a batch selection strategy which considers interdependencies among data points, optimizing the efficiency of the training process while maintaining a focus on data relevance. Experiments on English to Persian and several other language pairs using an mBART model fine-tuned on the CCMatrix dataset demonstrate that our method can achieve up to a fivefold improvement in data efficiency compared to an iid baseline. Experimental results indicate that our approach improves computational efficiency by 24 when utilizing cached embeddings, as it requires fewer training data points. Additionally, it enhances generalization, resulting in superior translation performance compared to random selection method.

Paper Structure

This paper contains 12 sections, 3 equations, 9 figures, 3 tables, 1 algorithm.

Figures (9)

  • Figure 1: Our proposed method diagram for data selection in machine translation
  • Figure 2: Comparison between our approach and independent and identically distributed (iid) training using BLEU and COMET-22 metrics on the filtered dataset.
  • Figure 3: (a, b) Comparison of our approach with iid training and individual sample training methods based using BLEU and COMET-22 metrics on the unfiltered dataset. (c) Batch selection is robust to overfitting on noisy data, especially in early stages of the training. (d) Comparison of Batch selection and iid on Arabic $\leftrightarrow$ English and Hindi $\leftrightarrow$ English. Each line represents the average of both to and from English directions for each language.
  • Figure 4: Comparison of our approach against iid training on German $\leftrightarrow$ English, French $\leftrightarrow$ English and Finnish $\leftrightarrow$ Englsih. Each line represents the average of both to and from English directions for each language.
  • Figure 5: We utilize a smaller model as a reference model, apply quantization to it, and demonstrate superior performance compared to iid.
  • ...and 4 more figures