Table of Contents
Fetching ...

Optimizing Low-Resource Language Model Training: Comprehensive Analysis of Multi-Epoch, Multi-Lingual, and Two-Stage Approaches

Kosuke Akimoto, Masafumi Oyamada

TL;DR

This paper exhaustively explore training setups for low-resource language LLM, combining these three approaches, and found the following insights for efficiently reducing the cost of hyperparameter search.

Abstract

In this paper, we address the challenge of optimizing training setups for Large Language Models (LLMs) of low-resource language with a limited amount of corpus. Existing works adopt multi-epoch, multi-lingual, and two-stage training to utilize the limited target language corpus efficiently. However, there is still a lack of understanding about the optimal hyperparameter setups for combining these three approaches to train LLMs. We exhaustively explore training setups for low-resource language LLM, combining these three approaches, and found the following insights for efficiently reducing the cost of hyperparameter search: (1) As the amount of target language corpus decreases, the optimal training approach shifts from monolingual single-stage training to multi-lingual two-stage training at a compute budget dependent threshold. (2) The optimal model scale remains stable regardless of the amount of target language corpus, allowing the use of the compute-optimal scale of monolingual training. (3) The optimal number of epochs can be extrapolated from smaller-scale experiments to larger scale using our proposed model. Also, we provide evidence that, in single-stage training, the target language validation loss follows a power law with respect to the target language ratio, with an exponent independent of the amount of data, model scale, and language pair.

Optimizing Low-Resource Language Model Training: Comprehensive Analysis of Multi-Epoch, Multi-Lingual, and Two-Stage Approaches

TL;DR

This paper exhaustively explore training setups for low-resource language LLM, combining these three approaches, and found the following insights for efficiently reducing the cost of hyperparameter search.

Abstract

In this paper, we address the challenge of optimizing training setups for Large Language Models (LLMs) of low-resource language with a limited amount of corpus. Existing works adopt multi-epoch, multi-lingual, and two-stage training to utilize the limited target language corpus efficiently. However, there is still a lack of understanding about the optimal hyperparameter setups for combining these three approaches to train LLMs. We exhaustively explore training setups for low-resource language LLM, combining these three approaches, and found the following insights for efficiently reducing the cost of hyperparameter search: (1) As the amount of target language corpus decreases, the optimal training approach shifts from monolingual single-stage training to multi-lingual two-stage training at a compute budget dependent threshold. (2) The optimal model scale remains stable regardless of the amount of target language corpus, allowing the use of the compute-optimal scale of monolingual training. (3) The optimal number of epochs can be extrapolated from smaller-scale experiments to larger scale using our proposed model. Also, we provide evidence that, in single-stage training, the target language validation loss follows a power law with respect to the target language ratio, with an exponent independent of the amount of data, model scale, and language pair.

Paper Structure

This paper contains 19 sections, 6 equations, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: Changes in minimum target language validation loss with respect to target language corpus size $D_T$ (x-axis) for each training approach. The red dashed vertical line represents the estimated compute-optimal amount of corpus $D^*(C)$ for monolingual target language training (see §\ref{['subsec:optimal_training_approach']}). For small $D_T$ (low-resource setting), the optimal training approach switches from monolingual single-stage training (mono 1-stage) to multi-lingual two-stage training (multi 2-stage) at the threshold $\frac{D^*(C)}{8}$.
  • Figure 2: Changes in minimum target language validation loss with respect to target language corpus size $D_T$ for each model scale $M$ (Multi 2-stage, (Japanese, English) pair). Each line represents a different model scale, and each star represents the estimated compute-optimal model scale $M^*(C)$ of monolingual training in the target language. Across a wide range of $D_T$, the model scale achieving the minimum validation loss remains relatively unchanged from $M^*(C)$.
  • Figure 3: (Left) Changes in minimum target language validation loss $L^*(C, D_T,k)$ with respect to number of epochs $k$ (x-axis) for each target language corpus size $D_T$ (color). Each dashed line represents a quadratic function of $f_k=\log_2 k$ fitted to estimate $L^*(C,D_T,k)$. The red solid line connects the minimum points of the fitted curves. (Middle, Right) Changes in the estimated optimal number of epochs $k^*$ with respect to target language corpus size $D_T$ (x-axis) for each computational budgets $C$ (color). Solid lines represent the estimates obtained through quadratic function fitting. Each point represents the optimal value $\arg\min_{k} L^*(C,f_D,k)$ obtained without fitting. The dashed lines show the predicted optimal values $k^*$ for $C=C_0$, extrapolated using the proposed model of Equation (\ref{['eq:fk_star_model']}).
  • Figure 4: Changes in target language validation loss with respect to the target language ratio $r$ (x-axis). Each solid line connects results corresponding to the same pair of model scale $M$ (color) and the total amount of training tokens $D$ (marker size). The red dashed lines illustrate the slope $\beta$ obtained by fitting the validation loss with Equation (\ref{['eq:r_model']}). The figure shows that changes in validation loss due to changes in $r$ follow similar slopes $\beta$, regardless of $M$, $D$, and language pairs.
  • Figure 5: Minimum validation loss on the target language achieved by different LLM training approaches across various amounts of the available target language corpus $D_T$. Each panel represents different pair of the computational budget $C$ and the language pair. The red dashed vertical lines represents the estimated compute-optimal amount of target language corpus $D^*(C)$ for $C$.
  • ...and 5 more figures