BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model

Nolan Dey; Daria Soboleva; Faisal Al-Khateeb; Bowen Yang; Ribhu Pathria; Hemant Khachane; Shaheer Muhammad; Zhiming; Chen; Robert Myers; Jacob Robert Steeves; Natalia Vassilieva; Marvin Tom; Joel Hestness

BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model

Nolan Dey, Daria Soboleva, Faisal Al-Khateeb, Bowen Yang, Ribhu Pathria, Hemant Khachane, Shaheer Muhammad, Zhiming, Chen, Robert Myers, Jacob Robert Steeves, Natalia Vassilieva, Marvin Tom, Joel Hestness

TL;DR

BTLM-3B-8K tackles the challenge of delivering high-quality language understanding and generation with a compact 3B parameter model that supports long contexts. The authors combine SwiGLU, ALiBi, and μP with a two-phase training regime on a deduplicated SlimPajama dataset (627B tokens) to achieve state-of-the-art 3B performance and competitive results with some 7B models, while enabling 3GB RAM inference with 4-bit quantization. They demonstrate strong results across CSR, RC, WK, MMLU, code, and long-context benchmarks, and show notable gains in long-context interpolation and extrapolation, aided by variable context-length training. The work emphasizes data quality, training stability, and hardware scalability (Cerebras CS-2) and releases both model weights and SlimPajama data under Apache 2.0, potentially broadening access to powerful LLM capabilities on edge devices and for long-document tasks.

Abstract

We introduce the Bittensor Language Model, called "BTLM-3B-8K", a new state-of-the-art 3 billion parameter open-source language model. BTLM-3B-8K was trained on 627B tokens from the SlimPajama dataset with a mixture of 2,048 and 8,192 context lengths. BTLM-3B-8K outperforms all existing 3B parameter models by 2-5.5% across downstream tasks. BTLM-3B-8K is even competitive with some 7B parameter models. Additionally, BTLM-3B-8K provides excellent long context performance, outperforming MPT-7B-8K and XGen-7B-8K on tasks up to 8,192 context length. We trained the model on a cleaned and deduplicated SlimPajama dataset; aggressively tuned the \textmu P hyperparameters and schedule; used ALiBi position embeddings; and adopted the SwiGLU nonlinearity. On Hugging Face, the most popular models have 7B parameters, indicating that users prefer the quality-size ratio of 7B models. Compacting the 7B parameter model to one with 3B parameters, with little performance impact, is an important milestone. BTLM-3B-8K needs only 3GB of memory with 4-bit precision and takes 2.5x less inference compute than 7B models, helping to open up access to a powerful language model on mobile and edge devices. BTLM-3B-8K is available under an Apache 2.0 license on Hugging Face: https://huggingface.co/cerebras/btlm-3b-8k-base.

BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model

TL;DR

Abstract

Paper Structure (34 sections, 1 equation, 6 figures, 13 tables)

This paper contains 34 sections, 1 equation, 6 figures, 13 tables.

Introduction
BTLM Architecture and Training
Model Architecture
Pretraining Data
Training Procedure
Training Loss Stability
Hardware
Model Evaluation
Common Sense Reasoning
Reading Comprehension
World Knowledge
Massive Multitask Language Understanding
Code
Long Context Evaluation
Long Context Interpolation
...and 19 more sections

Figures (6)

Figure 1: SlimPajama train cross-entropy loss versus training tokens. Training was scaled between different numbers of CS-2 systems depending on cluster availability.
Figure 2: Accuracy on the LongEval-Lines and LongEval-Topics long-range retrieval tasks.
Figure 3: SlimPajama test set cross-entropy loss for various BTLM checkpoints at each token position. Inference is performed on examples packed to 32768 tokens in length.
Figure 4: Overview of each architecture and training hyperparameter improvement ablated starting from a CerebrasGPT-$\mu$P baseline dey2023cerebrasgpt. Power law fits are included for 20 TPP and 236.4 TPP baselines. Relative to these power laws we illustrate the FLOP and parameter differences at the same loss.
Figure 5: Loss versus token position for various sequence length schedules. Loss is plotted with a 100 value moving average to improve plot readability.
...and 1 more figures

BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model

TL;DR

Abstract

BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model

Authors

TL;DR

Abstract

Table of Contents

Figures (6)