Steel-LLM:From Scratch to Open Source -- A Personal Journey in Building a Chinese-Centric LLM

Qingshui Gu; Shu Li; Tianyu Zheng; Zhaoxiang Zhang

Steel-LLM:From Scratch to Open Source -- A Personal Journey in Building a Chinese-Centric LLM

Qingshui Gu, Shu Li, Tianyu Zheng, Zhaoxiang Zhang

TL;DR

Steel-LLM demonstrates that a compact, open-source Chinese-centric LLM can achieve competitive benchmarks under restricted compute by combining Soft MoE in the FFN with an enhanced FFN, a resourceful training framework, and careful data strategy. The work provides full transparency, including the training pipeline, datasets, intermediate checkpoints, and ablation results, to facilitate reproducibility. Key contributions include a detailed resource-efficient architecture, an efficient pretraining and post-training workflow, and evidence that data distribution and exam-style tasks improve performance on Chinese and multilingual benchmarks. The open-source release and practical guidance have substantial implications for accessible LLM development among smaller research teams and non-English language communities.

Abstract

Steel-LLM is a Chinese-centric language model developed from scratch with the goal of creating a high-quality, open-source model despite limited computational resources. Launched in March 2024, the project aimed to train a 1-billion-parameter model on a large-scale dataset, prioritizing transparency and the sharing of practical insights to assist others in the community. The training process primarily focused on Chinese data, with a small proportion of English data included, addressing gaps in existing open-source LLMs by providing a more detailed and practical account of the model-building journey. Steel-LLM has demonstrated competitive performance on benchmarks such as CEVAL and CMMLU, outperforming early models from larger institutions. This paper provides a comprehensive summary of the project's key contributions, including data collection, model design, training methodologies, and the challenges encountered along the way, offering a valuable resource for researchers and practitioners looking to develop their own LLMs. The model checkpoints and training script are available at https://github.com/zhanshijinwat/Steel-LLM.

Steel-LLM:From Scratch to Open Source -- A Personal Journey in Building a Chinese-Centric LLM

TL;DR

Abstract

Steel-LLM:From Scratch to Open Source -- A Personal Journey in Building a Chinese-Centric LLM

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)