Table of Contents
Fetching ...

Spectra: Surprising Effectiveness of Pretraining Ternary Language Models at Scale

Ayush Kaushal, Tejas Vaidhya, Arnab Kumar Mondal, Tejas Pandey, Aaryan Bhagat, Irina Rish

TL;DR

This work investigates memory bottlenecks in large language model inference and proposes TriLMs, ternary-weight transformers, as a memory-efficient alternative to floating-point and post-training-quantized models. The authors introduce Spectra, an open suite spanning FloatLMs, QuantLMs (3–8 bits), and TriLMs across 99M–3.9B parameters trained on 300B tokens, and show that TriLMs exhibit superior scaling in model size (bits) and competitive or superior downstream performance at scale, notably matching FloatLMs at 3.9B parameters despite using far fewer bits. They analyze memory implications, information-theoretic weight variance, and optimization dynamics, deriving scaling laws with exponent $\alpha=0.26$ and discussing practical deployment benefits including edge devices and reduced training costs. The release of 500+ intermediate checkpoints and the broader discussion of interpretability and hardware impact position Spectra as a valuable resource for researchers and practitioners aiming to build efficient, scalable LLMs with real-world latency and memory constraints.

Abstract

Rapid advancements in GPU computational power has outpaced memory capacity and bandwidth growth, creating bottlenecks in Large Language Model (LLM) inference. Post-training quantization is the leading method for addressing memory-related bottlenecks in LLM inference, but it suffers from significant performance degradation below 4-bit precision. This paper addresses these challenges by investigating the pretraining of low-bitwidth models specifically Ternary Language Models (TriLMs) as an alternative to traditional floating-point models (FloatLMs) and their post-training quantized versions (QuantLMs). We present Spectra LLM suite, the first open suite of LLMs spanning multiple bit-widths, including FloatLMs, QuantLMs, and TriLMs, ranging from 99M to 3.9B parameters trained on 300B tokens. Our comprehensive evaluation demonstrates that TriLMs offer superior scaling behavior in terms of model size (in bits). Surprisingly, at scales exceeding one billion parameters, TriLMs consistently outperform their QuantLM and FloatLM counterparts for a given bit size across various benchmarks. Notably, the 3.9B parameter TriLM matches the performance of the FloatLM 3.9B across all benchmarks, despite having fewer bits than FloatLM 830M. Overall, this research provides valuable insights into the feasibility and scalability of low-bitwidth language models, paving the way for the development of more efficient LLMs. To enhance understanding of low-bitwidth models, we are releasing 500+ intermediate checkpoints of the Spectra suite at https://github.com/NolanoOrg/SpectraSuite.

Spectra: Surprising Effectiveness of Pretraining Ternary Language Models at Scale

TL;DR

This work investigates memory bottlenecks in large language model inference and proposes TriLMs, ternary-weight transformers, as a memory-efficient alternative to floating-point and post-training-quantized models. The authors introduce Spectra, an open suite spanning FloatLMs, QuantLMs (3–8 bits), and TriLMs across 99M–3.9B parameters trained on 300B tokens, and show that TriLMs exhibit superior scaling in model size (bits) and competitive or superior downstream performance at scale, notably matching FloatLMs at 3.9B parameters despite using far fewer bits. They analyze memory implications, information-theoretic weight variance, and optimization dynamics, deriving scaling laws with exponent and discussing practical deployment benefits including edge devices and reduced training costs. The release of 500+ intermediate checkpoints and the broader discussion of interpretability and hardware impact position Spectra as a valuable resource for researchers and practitioners aiming to build efficient, scalable LLMs with real-world latency and memory constraints.

Abstract

Rapid advancements in GPU computational power has outpaced memory capacity and bandwidth growth, creating bottlenecks in Large Language Model (LLM) inference. Post-training quantization is the leading method for addressing memory-related bottlenecks in LLM inference, but it suffers from significant performance degradation below 4-bit precision. This paper addresses these challenges by investigating the pretraining of low-bitwidth models specifically Ternary Language Models (TriLMs) as an alternative to traditional floating-point models (FloatLMs) and their post-training quantized versions (QuantLMs). We present Spectra LLM suite, the first open suite of LLMs spanning multiple bit-widths, including FloatLMs, QuantLMs, and TriLMs, ranging from 99M to 3.9B parameters trained on 300B tokens. Our comprehensive evaluation demonstrates that TriLMs offer superior scaling behavior in terms of model size (in bits). Surprisingly, at scales exceeding one billion parameters, TriLMs consistently outperform their QuantLM and FloatLM counterparts for a given bit size across various benchmarks. Notably, the 3.9B parameter TriLM matches the performance of the FloatLM 3.9B across all benchmarks, despite having fewer bits than FloatLM 830M. Overall, this research provides valuable insights into the feasibility and scalability of low-bitwidth language models, paving the way for the development of more efficient LLMs. To enhance understanding of low-bitwidth models, we are releasing 500+ intermediate checkpoints of the Spectra suite at https://github.com/NolanoOrg/SpectraSuite.
Paper Structure (65 sections, 1 equation, 22 figures, 13 tables)

This paper contains 65 sections, 1 equation, 22 figures, 13 tables.

Figures (22)

  • Figure 1: Common Sense and Reasoning (C&R) & LAMBADA Accuracy for ternary TriLM, FP16 FloatLM and quantized QuantLM models across different model sizes, in bits and number of parameters. C&R scores are averaged across 6 benchmarks. At 3B+ scales, TriLMs demonstrate better performance for their size than QuantLM and competitive performance to FloatLM of the same parameters. See Tables \ref{['tab:evaluation_spectra_suite_part1.a']}, \ref{['tab:evaluation_spectra_suite_part1.b']} and \ref{['tab:evaluation_spectra_suite_part2']} for details.
  • Figure 2: Expected gains from low bitwidth modeling. TriLMs can fit over 300B parameters on a single H100 and achieve up to a theoretical maximum of 10x faster autoregressive decoding compared to FloatLM.
  • Figure 3: Shannon entropy (in bits) of discretized weight distribution with increasing number of bins.
  • Figure 4: Differential entropy of Gaussian fits on weight distributions across different scales.
  • Figure 5: The computational flow of forward, backward, and inference processes in TriLM's linear layer with N-Way model parallelism.
  • ...and 17 more figures