BitNet b1.58 Reloaded: State-of-the-art Performance Also on Smaller Networks

Jacob Nielsen; Peter Schneider-Kamp

BitNet b1.58 Reloaded: State-of-the-art Performance Also on Smaller Networks

Jacob Nielsen, Peter Schneider-Kamp

TL;DR

The paper tackles efficient deployment of neural networks by extending 1.58-bit quantization to small language and vision models. It introduces a median-based variant of BitNet b1.58 implemented via the BitLinear layer, quantizing weights to $\{-1,0,1\}$ and activations to 8-bit with a straight-through gradient estimator. Across experiments on small language models and tiny vision datasets, the approach delivers near state-of-the-art results for SLMs when hidden sizes are increased and achieves competitive performance on small vision tasks, while analyzing robustness to learning rate and weight decay. These findings support the viability of 1.58-bit QAT for low-resource deployment and motivate future work on broader architectures and hardware-accelerated implementations.

Abstract

Recently proposed methods for 1-bit and 1.58-bit quantization aware training investigate the performance and behavior of these methods in the context of large language models, finding state-of-the-art performance for models with more than 3B parameters. In this work, we investigate 1.58-bit quantization for small language and vision models ranging from 100K to 48M parameters. We introduce a variant of BitNet b1.58, which allows to rely on the median rather than the mean in the quantization process. Through extensive experiments we investigate the performance of 1.58-bit models obtained through quantization aware training. We further investigate the robustness of 1.58-bit quantization-aware training to changes in the learning rate and regularization through weight decay, finding different patterns for small language and vision models than previously reported for large language models. Our results showcase that 1.58-bit quantization-aware training provides state-of-the-art performance for small language models when doubling hidden layer sizes and reaches or even surpasses state-of-the-art performance for small vision models of identical size. Ultimately, we demonstrate that 1.58-bit quantization-aware training is a viable and promising approach also for training smaller deep learning networks, facilitating deployment of such models in low-resource use-cases and encouraging future research.

BitNet b1.58 Reloaded: State-of-the-art Performance Also on Smaller Networks

TL;DR

and activations to 8-bit with a straight-through gradient estimator. Across experiments on small language models and tiny vision datasets, the approach delivers near state-of-the-art results for SLMs when hidden sizes are increased and achieves competitive performance on small vision tasks, while analyzing robustness to learning rate and weight decay. These findings support the viability of 1.58-bit QAT for low-resource deployment and motivate future work on broader architectures and hardware-accelerated implementations.

Abstract

Paper Structure (9 sections, 6 equations, 5 figures, 3 tables)

This paper contains 9 sections, 6 equations, 5 figures, 3 tables.

Introduction
Method
b1.58 Quantization
Experimental setup
Results
Small Language Models
Small Vision Models
Discussion
Conclusion

Figures (5)

Figure 1: The BitLinear layer is the backbone of the BitNet 1.58 Bits Reloaded architecture. It provides a drop-in replacement for linear layers (often referred to as feed-forward networks or multi-level perceptrons) in any architecture. AbsMeasure denotes the mean or median of the absolute values of the weight. The two factors $x_{scale}$ and $w_{scale}$ denote two scaling factors for the input and 16-bit weights respectively, used in the dequantization. We employ a straight-through estimator for the backward computations of the gradients.
Figure 2: Scaling behaviour of 16-bit and 1.58-bit (mean and medium) training for SLMs over 10 epochs (= 1,020 evaluations on test set.)
Figure 3: Hyperparameter tuning regarding weight decay and learning rate for SLMs over 10 epochs (= 1,020 evaluations on test set.)
Figure 4: The effect of weight decay (WD) on the training robustness for CIFAR100 over 10 epochs.
Figure 5: The effect of the learning rate (LR) on the training robustness for CIFAR100 over 10 epochs.

BitNet b1.58 Reloaded: State-of-the-art Performance Also on Smaller Networks

TL;DR

Abstract

BitNet b1.58 Reloaded: State-of-the-art Performance Also on Smaller Networks

Authors

TL;DR

Abstract

Table of Contents

Figures (5)