Table of Contents
Fetching ...

Continual Quantization-Aware Pre-Training: When to transition from 16-bit to 1.58-bit pre-training for BitNet language models?

Jacob Nielsen, Peter Schneider-Kamp, Lukas Galke

TL;DR

This paper investigates whether starting from 16-bit pre-training and transitioning to 1.58-bit quantization-aware training yields better performance and efficiency than training entirely at 1.58-bit. It introduces a continual 16-to-1.58-bit pre-training framework, analyzes the optimal transition point $t^\star$, and examines the roles of optimizer-state retention and gradual quantization phasing. Across 11 downstream tasks, continual pre-training consistently outperforms full 1.58-bit training and is competitive with 16-bit training, highlighting practical pathways to inference-efficient LLMs. The findings imply that converting existing 16-bit models to 1.58-bit models via continued training is viable and data-efficient, with modest hyperparameter implications.

Abstract

Large language models (LLMs) require immense resources for training and inference. Quantization, a technique that reduces the precision of model parameters, offers a promising solution for improving LLM efficiency and sustainability. While post-training quantization methods typically achieve 4-8 bits per parameter, recent research suggests that training LLMs with 1.58 bits per weight parameter from scratch can maintain model accuracy while greatly reducing memory requirements and energy consumption at inference time. Here, we investigate a training strategy for quantization-aware pre-training, where the models are first trained with 16-bit precision and then transition into 1.58-bit quantization-aware training. Our results on 11 downstream tasks show that this 16-to-1.58-bit training strategy is preferable over full 1.58-bit training and leaves models closer to those which have undergone 16-bit training. We further investigate the effects of retaining the optimizer state at the transition point and gradually phasing in quantization strength -- finding that both techniques alleviate the magnitude of loss spikes, but also that these effects can be compensated through further training.

Continual Quantization-Aware Pre-Training: When to transition from 16-bit to 1.58-bit pre-training for BitNet language models?

TL;DR

This paper investigates whether starting from 16-bit pre-training and transitioning to 1.58-bit quantization-aware training yields better performance and efficiency than training entirely at 1.58-bit. It introduces a continual 16-to-1.58-bit pre-training framework, analyzes the optimal transition point , and examines the roles of optimizer-state retention and gradual quantization phasing. Across 11 downstream tasks, continual pre-training consistently outperforms full 1.58-bit training and is competitive with 16-bit training, highlighting practical pathways to inference-efficient LLMs. The findings imply that converting existing 16-bit models to 1.58-bit models via continued training is viable and data-efficient, with modest hyperparameter implications.

Abstract

Large language models (LLMs) require immense resources for training and inference. Quantization, a technique that reduces the precision of model parameters, offers a promising solution for improving LLM efficiency and sustainability. While post-training quantization methods typically achieve 4-8 bits per parameter, recent research suggests that training LLMs with 1.58 bits per weight parameter from scratch can maintain model accuracy while greatly reducing memory requirements and energy consumption at inference time. Here, we investigate a training strategy for quantization-aware pre-training, where the models are first trained with 16-bit precision and then transition into 1.58-bit quantization-aware training. Our results on 11 downstream tasks show that this 16-to-1.58-bit training strategy is preferable over full 1.58-bit training and leaves models closer to those which have undergone 16-bit training. We further investigate the effects of retaining the optimizer state at the transition point and gradually phasing in quantization strength -- finding that both techniques alleviate the magnitude of loss spikes, but also that these effects can be compensated through further training.

Paper Structure

This paper contains 26 sections, 4 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: 16-to-1.58-bit continual pre-training. Blue and yellow denotes batches processed under 16-bit and 1.58-bit training, respectively.
  • Figure 2: Training loss curves comparing the effect of different variants of 1.58-bit continual pre-training from 16-bit models into 1.58-bit models. Our baseline is full 1.58-bit quantization-aware pre-training (cpt-1.58-10K; blue). Standard 16-bit training without quantization (cpt-16-10K) is displayed in red. Continual pre-trainings, i.e., transitioning from 16-bit into 1.58-bit training, are marked according to the transition points at 2K, 4K, and 6K optimizer steps, respectively. All models have been trained for 10K steps in total. All training loss curves have been smoothed with an exponential filter with window size $64$.
  • Figure 3: Downstream evaluation of full 16-bit, continually pre-trained 1.58-bit, and full 1.58 trainings. The y axes are not scaled such that relative differences are more visible.