Table of Contents
Fetching ...

ChuXin: 1.6B Technical Report

Xiaomin Zhuang, Yufan Jiang, Qiaozhi He, Zhihua Wu

TL;DR

ChuXin addresses the need for fully open, transparent large language models by presenting a $1.6\text{B}$-parameter, open-source LLM trained on a multilingual corpus totaling $2.3\text{T}$ tokens, with training data, processes, and evaluation code publicly released. Building on the LLaMA2 backbone, ChuXin integrates architectural choices such as Rotary Positional Embeddings (RoPE), RMSNorm, a block-diagonal attention mask, and the DeepSeek BBPE tokenizer, while employing a SwiGLU activation and omitting biases and weight tying. A key contribution is extending the context length to $10^6$ tokens through light continual pretraining on length-upsampled data, enabling strong long-context retrieval performance. Empirically, ChuXin achieves competitive results with other open-source $1.6\text{B}$-scale models on English and Chinese benchmarks and demonstrates the value of full open-sourcing for reproducibility, bias analysis, and risk assessment. The work lays a foundation for future open-source growth, including instruction tuning and multimodal extensions, and promises ongoing documentation of training challenges to guide the community.

Abstract

In this report, we present ChuXin, an entirely open-source language model with a size of 1.6 billion parameters. Unlike the majority of works that only open-sourced the model weights and architecture, we have made everything needed to train a model available, including the training data, the training process, and the evaluation code. Our goal is to empower and strengthen the open research community, fostering transparency and enabling a new wave of innovation in the field of language modeling. Furthermore, we extend the context length to 1M tokens through lightweight continual pretraining and demonstrate strong needle-in-a-haystack retrieval performance. The weights for both models are available at Hugging Face to download and use.

ChuXin: 1.6B Technical Report

TL;DR

ChuXin addresses the need for fully open, transparent large language models by presenting a -parameter, open-source LLM trained on a multilingual corpus totaling tokens, with training data, processes, and evaluation code publicly released. Building on the LLaMA2 backbone, ChuXin integrates architectural choices such as Rotary Positional Embeddings (RoPE), RMSNorm, a block-diagonal attention mask, and the DeepSeek BBPE tokenizer, while employing a SwiGLU activation and omitting biases and weight tying. A key contribution is extending the context length to tokens through light continual pretraining on length-upsampled data, enabling strong long-context retrieval performance. Empirically, ChuXin achieves competitive results with other open-source -scale models on English and Chinese benchmarks and demonstrates the value of full open-sourcing for reproducibility, bias analysis, and risk assessment. The work lays a foundation for future open-source growth, including instruction tuning and multimodal extensions, and promises ongoing documentation of training challenges to guide the community.

Abstract

In this report, we present ChuXin, an entirely open-source language model with a size of 1.6 billion parameters. Unlike the majority of works that only open-sourced the model weights and architecture, we have made everything needed to train a model available, including the training data, the training process, and the evaluation code. Our goal is to empower and strengthen the open research community, fostering transparency and enabling a new wave of innovation in the field of language modeling. Furthermore, we extend the context length to 1M tokens through lightweight continual pretraining and demonstrate strong needle-in-a-haystack retrieval performance. The weights for both models are available at Hugging Face to download and use.
Paper Structure (18 sections, 2 figures, 8 tables)

This paper contains 18 sections, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Performance on commonsense reasoning benchmarks during pre-training
  • Figure 2: Retrieval test on ChuXin-1M across context lengths via "Needle In a Haystack"