Table of Contents
Fetching ...

PCMind-2.1-Kaiyuan-2B Technical Report

Kairong Luo, Zhenbo Sun, Xinyu Shi, Shengqi Chen, Bowen Yu, Yunyi Chen, Chenyi Dang, Hengtao Tao, Hui Wang, Fangming Liu, Kaifeng Lyu, Wenguang Chen

TL;DR

This technical report presents Kaiyuan-2B, a fully open-source 2B-parameter LLM designed for resource-constrained pretraining. It tackles data heterogeneity and limited compute via Quantile Data Benchmarking, Strategic Selective Repetition, and a Multi-Domain Curriculum Training pipeline, supported by a Spark-based preprocessing stack and FP16-stability techniques. The work demonstrates competitive performance in Chinese, math, and code while narrowing gaps to open-weight models, validating a practical recipe for open-source LLM pretraining. By releasing weights, data, and code under Apache 2.0, the paper provides a transparent, reproducible path for academia to advance open-source LLM capabilities under restricted resources.

Abstract

The rapid advancement of Large Language Models (LLMs) has resulted in a significant knowledge gap between the open-source community and industry, primarily because the latter relies on closed-source, high-quality data and training recipes. To address this, we introduce PCMind-2.1-Kaiyuan-2B, a fully open-source 2-billion-parameter model focused on improving training efficiency and effectiveness under resource constraints. Our methodology includes three key innovations: a Quantile Data Benchmarking method for systematically comparing heterogeneous open-source datasets and providing insights on data mixing strategies; a Strategic Selective Repetition scheme within a multi-phase paradigm to effectively leverage sparse, high-quality data; and a Multi-Domain Curriculum Training policy that orders samples by quality. Supported by a highly optimized data preprocessing pipeline and architectural modifications for FP16 stability, Kaiyuan-2B achieves performance competitive with state-of-the-art fully open-source models, demonstrating practical and scalable solutions for resource-limited pretraining. We release all assets (including model weights, data, and code) under Apache 2.0 license at https://huggingface.co/thu-pacman/PCMind-2.1-Kaiyuan-2B.

PCMind-2.1-Kaiyuan-2B Technical Report

TL;DR

This technical report presents Kaiyuan-2B, a fully open-source 2B-parameter LLM designed for resource-constrained pretraining. It tackles data heterogeneity and limited compute via Quantile Data Benchmarking, Strategic Selective Repetition, and a Multi-Domain Curriculum Training pipeline, supported by a Spark-based preprocessing stack and FP16-stability techniques. The work demonstrates competitive performance in Chinese, math, and code while narrowing gaps to open-weight models, validating a practical recipe for open-source LLM pretraining. By releasing weights, data, and code under Apache 2.0, the paper provides a transparent, reproducible path for academia to advance open-source LLM capabilities under restricted resources.

Abstract

The rapid advancement of Large Language Models (LLMs) has resulted in a significant knowledge gap between the open-source community and industry, primarily because the latter relies on closed-source, high-quality data and training recipes. To address this, we introduce PCMind-2.1-Kaiyuan-2B, a fully open-source 2-billion-parameter model focused on improving training efficiency and effectiveness under resource constraints. Our methodology includes three key innovations: a Quantile Data Benchmarking method for systematically comparing heterogeneous open-source datasets and providing insights on data mixing strategies; a Strategic Selective Repetition scheme within a multi-phase paradigm to effectively leverage sparse, high-quality data; and a Multi-Domain Curriculum Training policy that orders samples by quality. Supported by a highly optimized data preprocessing pipeline and architectural modifications for FP16 stability, Kaiyuan-2B achieves performance competitive with state-of-the-art fully open-source models, demonstrating practical and scalable solutions for resource-limited pretraining. We release all assets (including model weights, data, and code) under Apache 2.0 license at https://huggingface.co/thu-pacman/PCMind-2.1-Kaiyuan-2B.

Paper Structure

This paper contains 52 sections, 2 equations, 15 figures, 16 tables, 1 algorithm.

Figures (15)

  • Figure 1: Model Performance Comparison. Kaiyuan-2B surpasses the frontier of fully open-source models at a similar scale, and approaches open-weight models such as Qwen2-1.5B qwen2 and Llama3.2-3B llama3.2. A full version of the corresponding benchmark scores is detailed in \ref{['tab:model_comparison_full']}.
  • Figure 2: Comparison of internal activation magnitudes before and after architectural optimization. The experiment is conducted with a 3B model.
  • Figure 3: Illustration of the Quantile Benchmarking Process. (1) Given a series of target quantiles (e.g., 0%, 20%, 40%, 60%, 80%), we select a data chunk around each target quantile as a probing dataset. (2) A small-scale reference model is then evaluated on each probing dataset under two settings: training from scratch, or resuming from a checkpoint for continual training.
  • Figure 4: Representative results showing task-dependent dataset characteristics: FineWeb-Edu excels on know-ledge-intensive tasks (MMLU) while DCLM-Baseline performs better on commonsense reasoning (WinoGrande).
  • Figure 5: Five training phases of Kaiyuan-2B. Latter phases keep more refined data samples.
  • ...and 10 more figures