Table of Contents
Fetching ...

GEB-1.3B: Open Lightweight Large Language Model

Jie Wu, Yufeng Zhu, Lei Shen, Xuqing Lu

TL;DR

The paper addresses the need for CPU-efficient large language models by introducing GEB-1.3B, a 1.3B-parameter bilingual LLM trained on 550B tokens. It combines efficiency-focused architectural choices (RoPE, Group-Query-Attention, FlashAttention-2) with alignment strategies (Supervised Fine-Tuning and Direct Preference Optimization) to achieve strong performance on general benchmarks while enabling CPU inference. Empirical results show competitive scores on MMLU, C-Eval, and CMMLU, with low toxicity and practical CPU runtime (about 12 tokens/s in FP32); quantization is planned for further acceleration. The model is released as open-source to foster research and enable edge deployment, illustrating that smaller models trained on large multilingual data can approach or exceed larger counterparts in many tasks while maintaining accessibility and efficiency.

Abstract

Recently developed large language models (LLMs) such as ChatGPT, Claude, and Llama have demonstrated impressive abilities, and even surpass human-level performance in several tasks. Despite their success, the resource-intensive demands of these models, requiring significant computational power for both training and inference, limit their deployment to high-performance servers. Additionally, the extensive calculation requirements of the models often lead to increased latency in response times. With the increasing need for LLMs to operate efficiently on CPUs, research about lightweight models that are optimized for CPU inference has emerged. In this work, we introduce GEB-1.3B, a lightweight LLM trained on 550 billion tokens in both Chinese and English languages. We employ novel training techniques, including ROPE, Group-Query-Attention, and FlashAttention-2, to accelerate training while maintaining model performance. Additionally, we fine-tune the model using 10 million samples of instruction data to enhance alignment. GEB-1.3B exhibits outstanding performance on general benchmarks such as MMLU, C-Eval, and CMMLU, outperforming comparative models such as MindLLM-1.3B and TinyLLaMA-1.1B. Notably, the FP32 version of GEB-1.3B achieves commendable inference times on CPUs, with ongoing efforts to further enhance speed through advanced quantization techniques. The release of GEB-1.3B as an open-source model marks a significant contribution to the development of lightweight LLMs, promising to foster further research and innovation in the field.

GEB-1.3B: Open Lightweight Large Language Model

TL;DR

The paper addresses the need for CPU-efficient large language models by introducing GEB-1.3B, a 1.3B-parameter bilingual LLM trained on 550B tokens. It combines efficiency-focused architectural choices (RoPE, Group-Query-Attention, FlashAttention-2) with alignment strategies (Supervised Fine-Tuning and Direct Preference Optimization) to achieve strong performance on general benchmarks while enabling CPU inference. Empirical results show competitive scores on MMLU, C-Eval, and CMMLU, with low toxicity and practical CPU runtime (about 12 tokens/s in FP32); quantization is planned for further acceleration. The model is released as open-source to foster research and enable edge deployment, illustrating that smaller models trained on large multilingual data can approach or exceed larger counterparts in many tasks while maintaining accessibility and efficiency.

Abstract

Recently developed large language models (LLMs) such as ChatGPT, Claude, and Llama have demonstrated impressive abilities, and even surpass human-level performance in several tasks. Despite their success, the resource-intensive demands of these models, requiring significant computational power for both training and inference, limit their deployment to high-performance servers. Additionally, the extensive calculation requirements of the models often lead to increased latency in response times. With the increasing need for LLMs to operate efficiently on CPUs, research about lightweight models that are optimized for CPU inference has emerged. In this work, we introduce GEB-1.3B, a lightweight LLM trained on 550 billion tokens in both Chinese and English languages. We employ novel training techniques, including ROPE, Group-Query-Attention, and FlashAttention-2, to accelerate training while maintaining model performance. Additionally, we fine-tune the model using 10 million samples of instruction data to enhance alignment. GEB-1.3B exhibits outstanding performance on general benchmarks such as MMLU, C-Eval, and CMMLU, outperforming comparative models such as MindLLM-1.3B and TinyLLaMA-1.1B. Notably, the FP32 version of GEB-1.3B achieves commendable inference times on CPUs, with ongoing efforts to further enhance speed through advanced quantization techniques. The release of GEB-1.3B as an open-source model marks a significant contribution to the development of lightweight LLMs, promising to foster further research and innovation in the field.
Paper Structure (16 sections, 2 figures, 4 tables)

This paper contains 16 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Loss curves before and after adopting four measures
  • Figure 2: The results on C-Eval benchmark.