Spectra: Surprising Effectiveness of Pretraining Ternary Language Models at Scale
Ayush Kaushal, Tejas Vaidhya, Arnab Kumar Mondal, Tejas Pandey, Aaryan Bhagat, Irina Rish
TL;DR
This work investigates memory bottlenecks in large language model inference and proposes TriLMs, ternary-weight transformers, as a memory-efficient alternative to floating-point and post-training-quantized models. The authors introduce Spectra, an open suite spanning FloatLMs, QuantLMs (3–8 bits), and TriLMs across 99M–3.9B parameters trained on 300B tokens, and show that TriLMs exhibit superior scaling in model size (bits) and competitive or superior downstream performance at scale, notably matching FloatLMs at 3.9B parameters despite using far fewer bits. They analyze memory implications, information-theoretic weight variance, and optimization dynamics, deriving scaling laws with exponent $\alpha=0.26$ and discussing practical deployment benefits including edge devices and reduced training costs. The release of 500+ intermediate checkpoints and the broader discussion of interpretability and hardware impact position Spectra as a valuable resource for researchers and practitioners aiming to build efficient, scalable LLMs with real-world latency and memory constraints.
Abstract
Rapid advancements in GPU computational power has outpaced memory capacity and bandwidth growth, creating bottlenecks in Large Language Model (LLM) inference. Post-training quantization is the leading method for addressing memory-related bottlenecks in LLM inference, but it suffers from significant performance degradation below 4-bit precision. This paper addresses these challenges by investigating the pretraining of low-bitwidth models specifically Ternary Language Models (TriLMs) as an alternative to traditional floating-point models (FloatLMs) and their post-training quantized versions (QuantLMs). We present Spectra LLM suite, the first open suite of LLMs spanning multiple bit-widths, including FloatLMs, QuantLMs, and TriLMs, ranging from 99M to 3.9B parameters trained on 300B tokens. Our comprehensive evaluation demonstrates that TriLMs offer superior scaling behavior in terms of model size (in bits). Surprisingly, at scales exceeding one billion parameters, TriLMs consistently outperform their QuantLM and FloatLM counterparts for a given bit size across various benchmarks. Notably, the 3.9B parameter TriLM matches the performance of the FloatLM 3.9B across all benchmarks, despite having fewer bits than FloatLM 830M. Overall, this research provides valuable insights into the feasibility and scalability of low-bitwidth language models, paving the way for the development of more efficient LLMs. To enhance understanding of low-bitwidth models, we are releasing 500+ intermediate checkpoints of the Spectra suite at https://github.com/NolanoOrg/SpectraSuite.
