ChuXin: 1.6B Technical Report

Xiaomin Zhuang; Yufan Jiang; Qiaozhi He; Zhihua Wu

ChuXin: 1.6B Technical Report

Xiaomin Zhuang, Yufan Jiang, Qiaozhi He, Zhihua Wu

TL;DR

ChuXin addresses the need for fully open, transparent large language models by presenting a $1.6\text{B}$-parameter, open-source LLM trained on a multilingual corpus totaling $2.3\text{T}$ tokens, with training data, processes, and evaluation code publicly released. Building on the LLaMA2 backbone, ChuXin integrates architectural choices such as Rotary Positional Embeddings (RoPE), RMSNorm, a block-diagonal attention mask, and the DeepSeek BBPE tokenizer, while employing a SwiGLU activation and omitting biases and weight tying. A key contribution is extending the context length to $10^6$ tokens through light continual pretraining on length-upsampled data, enabling strong long-context retrieval performance. Empirically, ChuXin achieves competitive results with other open-source $1.6\text{B}$-scale models on English and Chinese benchmarks and demonstrates the value of full open-sourcing for reproducibility, bias analysis, and risk assessment. The work lays a foundation for future open-source growth, including instruction tuning and multimodal extensions, and promises ongoing documentation of training challenges to guide the community.

Abstract

In this report, we present ChuXin, an entirely open-source language model with a size of 1.6 billion parameters. Unlike the majority of works that only open-sourced the model weights and architecture, we have made everything needed to train a model available, including the training data, the training process, and the evaluation code. Our goal is to empower and strengthen the open research community, fostering transparency and enabling a new wave of innovation in the field of language modeling. Furthermore, we extend the context length to 1M tokens through lightweight continual pretraining and demonstrate strong needle-in-a-haystack retrieval performance. The weights for both models are available at Hugging Face to download and use.

ChuXin: 1.6B Technical Report

TL;DR

ChuXin addresses the need for fully open, transparent large language models by presenting a

-parameter, open-source LLM trained on a multilingual corpus totaling

tokens, with training data, processes, and evaluation code publicly released. Building on the LLaMA2 backbone, ChuXin integrates architectural choices such as Rotary Positional Embeddings (RoPE), RMSNorm, a block-diagonal attention mask, and the DeepSeek BBPE tokenizer, while employing a SwiGLU activation and omitting biases and weight tying. A key contribution is extending the context length to

tokens through light continual pretraining on length-upsampled data, enabling strong long-context retrieval performance. Empirically, ChuXin achieves competitive results with other open-source

-scale models on English and Chinese benchmarks and demonstrates the value of full open-sourcing for reproducibility, bias analysis, and risk assessment. The work lays a foundation for future open-source growth, including instruction tuning and multimodal extensions, and promises ongoing documentation of training challenges to guide the community.

Abstract

Paper Structure (18 sections, 2 figures, 8 tables)

This paper contains 18 sections, 2 figures, 8 tables.

Introduction
Pretraining
Model and Architecture
Rotary positional embeddings (RoPE).
RMSNorm.
Attention Mask.
Tokenizer.
Further details.
Pretraining Data
Training
Results
Common Sense Reasoning and Reading Comprehension.
Open LLM Leaderboard.
Chinese Evaluation.
Performance during Training
...and 3 more sections

Figures (2)

Figure 1: Performance on commonsense reasoning benchmarks during pre-training
Figure 2: Retrieval test on ChuXin-1M across context lengths via "Needle In a Haystack"

ChuXin: 1.6B Technical Report

TL;DR

Abstract

ChuXin: 1.6B Technical Report

Authors

TL;DR

Abstract

Table of Contents

Figures (2)