Table of Contents
Fetching ...

Dynamic data sampler for cross-language transfer learning in large language models

Yudong Li, Yuhao Feng, Wen Zhou, Zhe Zhao, Linlin Shen, Cheng Hou, Xianxu Hou

TL;DR

ChatFlow tackles the challenge of training Chinese LLMs by transferring knowledge from an English foundation model through a cross-language, mixed-data curriculum. A curriculum-inspired dynamic data sampler smoothly shifts from English pre-training to bilingual pre-training and instruction-tuning, completed over $T_{grow}=5M$ samples. Trained on ~50GB data with an expanded Chinese vocabulary from LLaMA-7B, ChatFlow achieves superior results on Chinese benchmarks (MMLU, C-Eval, CMMLU, GAOKAO) while preserving English ability, though it still trails top commercial models on some human-evaluation metrics. The work provides a reproducible baseline for cross-language transfer using public data and released code/weights, highlighting a cost-effective path for languages with limited corpora.

Abstract

Large Language Models (LLMs) have gained significant attention in the field of natural language processing (NLP) due to their wide range of applications. However, training LLMs for languages other than English poses significant challenges, due to the difficulty in acquiring large-scale corpus and the requisite computing resources. In this paper, we propose ChatFlow, a cross-language transfer-based LLM, to address these challenges and train large Chinese language models in a cost-effective manner. We employ a mix of Chinese, English, and parallel corpus to continuously train the LLaMA2 model, aiming to align cross-language representations and facilitate the knowledge transfer specifically to the Chinese language model. In addition, we use a dynamic data sampler to progressively transition the model from unsupervised pre-training to supervised fine-tuning. Experimental results demonstrate that our approach accelerates model convergence and achieves superior performance. We evaluate ChatFlow on popular Chinese and English benchmarks, the results indicate that it outperforms other Chinese models post-trained on LLaMA-2-7B.

Dynamic data sampler for cross-language transfer learning in large language models

TL;DR

ChatFlow tackles the challenge of training Chinese LLMs by transferring knowledge from an English foundation model through a cross-language, mixed-data curriculum. A curriculum-inspired dynamic data sampler smoothly shifts from English pre-training to bilingual pre-training and instruction-tuning, completed over samples. Trained on ~50GB data with an expanded Chinese vocabulary from LLaMA-7B, ChatFlow achieves superior results on Chinese benchmarks (MMLU, C-Eval, CMMLU, GAOKAO) while preserving English ability, though it still trails top commercial models on some human-evaluation metrics. The work provides a reproducible baseline for cross-language transfer using public data and released code/weights, highlighting a cost-effective path for languages with limited corpora.

Abstract

Large Language Models (LLMs) have gained significant attention in the field of natural language processing (NLP) due to their wide range of applications. However, training LLMs for languages other than English poses significant challenges, due to the difficulty in acquiring large-scale corpus and the requisite computing resources. In this paper, we propose ChatFlow, a cross-language transfer-based LLM, to address these challenges and train large Chinese language models in a cost-effective manner. We employ a mix of Chinese, English, and parallel corpus to continuously train the LLaMA2 model, aiming to align cross-language representations and facilitate the knowledge transfer specifically to the Chinese language model. In addition, we use a dynamic data sampler to progressively transition the model from unsupervised pre-training to supervised fine-tuning. Experimental results demonstrate that our approach accelerates model convergence and achieves superior performance. We evaluate ChatFlow on popular Chinese and English benchmarks, the results indicate that it outperforms other Chinese models post-trained on LLaMA-2-7B.
Paper Structure (12 sections, 2 equations, 3 figures, 3 tables)

This paper contains 12 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Training loss over the number of trained tokens in the ablation study.
  • Figure 2: Evaluation metrics over trained tokens in the ablation study. The model's performance on the English evaluation MMLU is shown on the blue line, while its performance on the Chinese evaluation C-Eval is shown on the red line.
  • Figure 3: Win rate for all models in non-tie matches. ChatFlow ranks 5th among the 7B models.