Table of Contents
Fetching ...

YuLan-Mini: An Open Data-efficient Language Model

Yiwen Hu, Huatong Song, Jia Deng, Jiapeng Wang, Jie Chen, Kun Zhou, Yutao Zhu, Jinhao Jiang, Zican Dong, Wayne Xin Zhao, Ji-Rong Wen

TL;DR

YuLan-Mini targets data-efficient pre-training for a $2.42\mathrm{B}$-parameter decoder-only transformer trained on $1.08\mathrm{T}$ tokens, extending context to $28{,}672$ tokens via RoPE with ABF. It introduces a threefold strategy: a robust data pipeline with cleaning and curriculum (data schedule), a stability-focused optimization method, and an annealing regime that combines targeted data selection with long-context training. On math, coding, and general benchmarks, YuLan-Mini achieves competitive results with a fraction of the data used by larger models and is accompanied by full reproducibility resources, including data compositions and training details. The work demonstrates that carefully designed data, stabilization techniques, and staged training can yield strong base models suitable for university labs, while enabling deeper research into how model capabilities develop during pre-training.

Abstract

Effective pre-training of large language models (LLMs) has been challenging due to the immense resource demands and the complexity of the technical processes involved. This paper presents a detailed technical report on YuLan-Mini, a highly capable base model with 2.42B parameters that achieves top-tier performance among models of similar parameter scale. Our pre-training approach focuses on enhancing training efficacy through three key technical contributions: an elaborate data pipeline combines data cleaning with data schedule strategies, a robust optimization method to mitigate training instability, and an effective annealing approach that incorporates targeted data selection and long context training. Remarkably, YuLan-Mini, trained on 1.08T tokens, achieves performance comparable to industry-leading models that require significantly more data. To facilitate reproduction, we release the full details of the data composition for each training phase. Project details can be accessed at the following link: https://github.com/RUC-GSAI/YuLan-Mini.

YuLan-Mini: An Open Data-efficient Language Model

TL;DR

YuLan-Mini targets data-efficient pre-training for a -parameter decoder-only transformer trained on tokens, extending context to tokens via RoPE with ABF. It introduces a threefold strategy: a robust data pipeline with cleaning and curriculum (data schedule), a stability-focused optimization method, and an annealing regime that combines targeted data selection with long-context training. On math, coding, and general benchmarks, YuLan-Mini achieves competitive results with a fraction of the data used by larger models and is accompanied by full reproducibility resources, including data compositions and training details. The work demonstrates that carefully designed data, stabilization techniques, and staged training can yield strong base models suitable for university labs, while enabling deeper research into how model capabilities develop during pre-training.

Abstract

Effective pre-training of large language models (LLMs) has been challenging due to the immense resource demands and the complexity of the technical processes involved. This paper presents a detailed technical report on YuLan-Mini, a highly capable base model with 2.42B parameters that achieves top-tier performance among models of similar parameter scale. Our pre-training approach focuses on enhancing training efficacy through three key technical contributions: an elaborate data pipeline combines data cleaning with data schedule strategies, a robust optimization method to mitigate training instability, and an effective annealing approach that incorporates targeted data selection and long context training. Remarkably, YuLan-Mini, trained on 1.08T tokens, achieves performance comparable to industry-leading models that require significantly more data. To facilitate reproduction, we release the full details of the data composition for each training phase. Project details can be accessed at the following link: https://github.com/RUC-GSAI/YuLan-Mini.

Paper Structure

This paper contains 99 sections, 14 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Performance comparison of YuLan-Mini against other base models, based on the average scores across eight benchmarks: GSM8K, MATH-500, HumanEval, MBPP, MMLU, ARC-Challenge, HellaSwag, and CEval. Floating Point Operations (FLOPs) are estimated using the scaling law formula $C=6ND$ proposed by kaplan_scaling_2020, where $N$ is the model size and $D$ is the size of the dataset. The models with a size larger than 3B are plotted in gray.
  • Figure 2: Training loss and gradients during pre-training process.
  • Figure 3: Comparison of training dynamics between divergent and convergent trial. The $y$-axis denotes the value of the hidden states variance and gradient norm on a log-scale. Both trials have consistent loss, but different trends of hidden states variance and gradient norm.
  • Figure 4: Variance of LN output of each layers.
  • Figure 5: Attention scores explodes before LN.
  • ...and 5 more figures