Table of Contents
Fetching ...

Dynamics of Instruction Fine-Tuning for Chinese Large Language Models

Chiyu Song, Zhanchao Zhou, Jianhao Yan, Yuejiao Fei, Zhenzhong Lan, Yue Zhang

TL;DR

This work investigates instruction-tuning scaling for Chinese LLMs using a new DoIT dataset of over 40,000 high-quality instruction instances across ten abilities. By systematically varying data quantity, model size (7B–33B), and data construction methods, the authors reveal that different abilities scale differently and identify two interpretable predictors, Complexity and Transference, that forecast scaling sensitivity even in low-resource settings. The findings enable targeted training strategies to efficiently learn specific abilities and improve performance on CMMLU and AGIEval, with extensive experiments across multiple backbones (e.g., Chinese-LLaMA, Qwen1.5, Baichuan2) and a large set of model checkpoints. The work also demonstrates that synthetic data offers limited benefits and provides open-source resources to support reproducibility and further research in instruction tuning for non-English languages.

Abstract

Instruction tuning is a burgeoning method to elicit the general intelligence of Large Language Models (LLMs). While numerous studies have examined the impact of factors such as data volume and model size on English models, the scaling properties of instruction tuning in other languages remain largely unexplored. In this work, we systematically investigate the effects of data quantity, model size, and data construction methods on instruction tuning for Chinese LLMs. We utilize a newly curated dataset, DoIT, which includes over 40,000 high-quality instruction instances covering ten underlying abilities, such as creative writing, code generation, and logical reasoning. Our experiments, conducted on models ranging from 7b to 33b parameters, yield three key findings: (i) While these factors directly affect overall model performance, some abilities are more responsive to scaling, whereas others demonstrate significant resistance. (ii) The scaling sensitivity of different abilities to these factors can be explained by two features: Complexity and Transference. (iii) By tailoring training strategies to their varying sensitivities, specific abilities can be efficiently learned, enhancing performance on two public benchmarks.

Dynamics of Instruction Fine-Tuning for Chinese Large Language Models

TL;DR

This work investigates instruction-tuning scaling for Chinese LLMs using a new DoIT dataset of over 40,000 high-quality instruction instances across ten abilities. By systematically varying data quantity, model size (7B–33B), and data construction methods, the authors reveal that different abilities scale differently and identify two interpretable predictors, Complexity and Transference, that forecast scaling sensitivity even in low-resource settings. The findings enable targeted training strategies to efficiently learn specific abilities and improve performance on CMMLU and AGIEval, with extensive experiments across multiple backbones (e.g., Chinese-LLaMA, Qwen1.5, Baichuan2) and a large set of model checkpoints. The work also demonstrates that synthetic data offers limited benefits and provides open-source resources to support reproducibility and further research in instruction tuning for non-English languages.

Abstract

Instruction tuning is a burgeoning method to elicit the general intelligence of Large Language Models (LLMs). While numerous studies have examined the impact of factors such as data volume and model size on English models, the scaling properties of instruction tuning in other languages remain largely unexplored. In this work, we systematically investigate the effects of data quantity, model size, and data construction methods on instruction tuning for Chinese LLMs. We utilize a newly curated dataset, DoIT, which includes over 40,000 high-quality instruction instances covering ten underlying abilities, such as creative writing, code generation, and logical reasoning. Our experiments, conducted on models ranging from 7b to 33b parameters, yield three key findings: (i) While these factors directly affect overall model performance, some abilities are more responsive to scaling, whereas others demonstrate significant resistance. (ii) The scaling sensitivity of different abilities to these factors can be explained by two features: Complexity and Transference. (iii) By tailoring training strategies to their varying sensitivities, specific abilities can be efficiently learned, enhancing performance on two public benchmarks.
Paper Structure (17 sections, 4 equations, 11 figures, 4 tables)

This paper contains 17 sections, 4 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: The impact of data volume, parameter scale, and construction method on the overall performance.
  • Figure 2: Abilities that are responsive to the data quantity and parameter scale in the human-curated dataset, also comparing the data efficiency of different construction methods with synthetic data.
  • Figure 3: Comparison of abilities with varying sensitivities to data scaling in the human-curated dataset, also comparing the data efficiency of different construction methods with synthetic data.
  • Figure 4: Comparison of abilities with varying sensitivities to parameter size scaling in the human-curated dataset, also comparing the data efficiency of different construction methods with synthetic data.
  • Figure 5: Abilities that are resistant to the data quantity and parameter scale in the human-curated dataset, also comparing the data efficiency of different construction methods with synthetic data.
  • ...and 6 more figures