Table of Contents
Fetching ...

BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model

Nolan Dey, Daria Soboleva, Faisal Al-Khateeb, Bowen Yang, Ribhu Pathria, Hemant Khachane, Shaheer Muhammad, Zhiming, Chen, Robert Myers, Jacob Robert Steeves, Natalia Vassilieva, Marvin Tom, Joel Hestness

TL;DR

BTLM-3B-8K tackles the challenge of delivering high-quality language understanding and generation with a compact 3B parameter model that supports long contexts. The authors combine SwiGLU, ALiBi, and μP with a two-phase training regime on a deduplicated SlimPajama dataset (627B tokens) to achieve state-of-the-art 3B performance and competitive results with some 7B models, while enabling 3GB RAM inference with 4-bit quantization. They demonstrate strong results across CSR, RC, WK, MMLU, code, and long-context benchmarks, and show notable gains in long-context interpolation and extrapolation, aided by variable context-length training. The work emphasizes data quality, training stability, and hardware scalability (Cerebras CS-2) and releases both model weights and SlimPajama data under Apache 2.0, potentially broadening access to powerful LLM capabilities on edge devices and for long-document tasks.

Abstract

We introduce the Bittensor Language Model, called "BTLM-3B-8K", a new state-of-the-art 3 billion parameter open-source language model. BTLM-3B-8K was trained on 627B tokens from the SlimPajama dataset with a mixture of 2,048 and 8,192 context lengths. BTLM-3B-8K outperforms all existing 3B parameter models by 2-5.5% across downstream tasks. BTLM-3B-8K is even competitive with some 7B parameter models. Additionally, BTLM-3B-8K provides excellent long context performance, outperforming MPT-7B-8K and XGen-7B-8K on tasks up to 8,192 context length. We trained the model on a cleaned and deduplicated SlimPajama dataset; aggressively tuned the \textmu P hyperparameters and schedule; used ALiBi position embeddings; and adopted the SwiGLU nonlinearity. On Hugging Face, the most popular models have 7B parameters, indicating that users prefer the quality-size ratio of 7B models. Compacting the 7B parameter model to one with 3B parameters, with little performance impact, is an important milestone. BTLM-3B-8K needs only 3GB of memory with 4-bit precision and takes 2.5x less inference compute than 7B models, helping to open up access to a powerful language model on mobile and edge devices. BTLM-3B-8K is available under an Apache 2.0 license on Hugging Face: https://huggingface.co/cerebras/btlm-3b-8k-base.

BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model

TL;DR

BTLM-3B-8K tackles the challenge of delivering high-quality language understanding and generation with a compact 3B parameter model that supports long contexts. The authors combine SwiGLU, ALiBi, and μP with a two-phase training regime on a deduplicated SlimPajama dataset (627B tokens) to achieve state-of-the-art 3B performance and competitive results with some 7B models, while enabling 3GB RAM inference with 4-bit quantization. They demonstrate strong results across CSR, RC, WK, MMLU, code, and long-context benchmarks, and show notable gains in long-context interpolation and extrapolation, aided by variable context-length training. The work emphasizes data quality, training stability, and hardware scalability (Cerebras CS-2) and releases both model weights and SlimPajama data under Apache 2.0, potentially broadening access to powerful LLM capabilities on edge devices and for long-document tasks.

Abstract

We introduce the Bittensor Language Model, called "BTLM-3B-8K", a new state-of-the-art 3 billion parameter open-source language model. BTLM-3B-8K was trained on 627B tokens from the SlimPajama dataset with a mixture of 2,048 and 8,192 context lengths. BTLM-3B-8K outperforms all existing 3B parameter models by 2-5.5% across downstream tasks. BTLM-3B-8K is even competitive with some 7B parameter models. Additionally, BTLM-3B-8K provides excellent long context performance, outperforming MPT-7B-8K and XGen-7B-8K on tasks up to 8,192 context length. We trained the model on a cleaned and deduplicated SlimPajama dataset; aggressively tuned the \textmu P hyperparameters and schedule; used ALiBi position embeddings; and adopted the SwiGLU nonlinearity. On Hugging Face, the most popular models have 7B parameters, indicating that users prefer the quality-size ratio of 7B models. Compacting the 7B parameter model to one with 3B parameters, with little performance impact, is an important milestone. BTLM-3B-8K needs only 3GB of memory with 4-bit precision and takes 2.5x less inference compute than 7B models, helping to open up access to a powerful language model on mobile and edge devices. BTLM-3B-8K is available under an Apache 2.0 license on Hugging Face: https://huggingface.co/cerebras/btlm-3b-8k-base.
Paper Structure (34 sections, 1 equation, 6 figures, 13 tables)

This paper contains 34 sections, 1 equation, 6 figures, 13 tables.

Figures (6)

  • Figure 1: SlimPajama train cross-entropy loss versus training tokens. Training was scaled between different numbers of CS-2 systems depending on cluster availability.
  • Figure 2: Accuracy on the LongEval-Lines and LongEval-Topics long-range retrieval tasks.
  • Figure 3: SlimPajama test set cross-entropy loss for various BTLM checkpoints at each token position. Inference is performed on examples packed to 32768 tokens in length.
  • Figure 4: Overview of each architecture and training hyperparameter improvement ablated starting from a CerebrasGPT-$\mu$P baseline dey2023cerebrasgpt. Power law fits are included for 20 TPP and 236.4 TPP baselines. Relative to these power laws we illustrate the FLOP and parameter differences at the same loss.
  • Figure 5: Loss versus token position for various sequence length schedules. Loss is plotted with a 100 value moving average to improve plot readability.
  • ...and 1 more figures