Table of Contents
Fetching ...

Unlocking the Theory Behind Scaling 1-Bit Neural Networks

Majid Daliri, Zhao Song, Chiwun Yang

TL;DR

It is proved that, despite the constraint of weights restricted to $\{-1, +1\}$, the dynamics of model training inevitably align with kernel behavior as the network width grows, and this theoretical breakthrough guarantees convergence of the 1-bit model to an arbitrarily small loss as width increases.

Abstract

Recently, 1-bit Large Language Models (LLMs) have emerged, showcasing an impressive combination of efficiency and performance that rivals traditional LLMs. Research by Wang et al. (2023); Ma et al. (2024) indicates that the performance of these 1-bit LLMs progressively improves as the number of parameters increases, hinting at the potential existence of a Scaling Law for 1-bit Neural Networks. In this paper, we present the first theoretical result that rigorously establishes this scaling law for 1-bit models. We prove that, despite the constraint of weights restricted to $\{-1, +1\}$, the dynamics of model training inevitably align with kernel behavior as the network width grows. This theoretical breakthrough guarantees convergence of the 1-bit model to an arbitrarily small loss as width increases. Furthermore, we introduce the concept of the generalization difference, defined as the gap between the outputs of 1-bit networks and their full-precision counterparts, and demonstrate that this difference maintains a negligible level as network width scales. Building on the work of Kaplan et al. (2020), we conclude by examining how the training loss scales as a power-law function of the model size, dataset size, and computational resources utilized for training. Our findings underscore the promising potential of scaling 1-bit neural networks, suggesting that int1 could become the standard in future neural network precision.

Unlocking the Theory Behind Scaling 1-Bit Neural Networks

TL;DR

It is proved that, despite the constraint of weights restricted to , the dynamics of model training inevitably align with kernel behavior as the network width grows, and this theoretical breakthrough guarantees convergence of the 1-bit model to an arbitrarily small loss as width increases.

Abstract

Recently, 1-bit Large Language Models (LLMs) have emerged, showcasing an impressive combination of efficiency and performance that rivals traditional LLMs. Research by Wang et al. (2023); Ma et al. (2024) indicates that the performance of these 1-bit LLMs progressively improves as the number of parameters increases, hinting at the potential existence of a Scaling Law for 1-bit Neural Networks. In this paper, we present the first theoretical result that rigorously establishes this scaling law for 1-bit models. We prove that, despite the constraint of weights restricted to , the dynamics of model training inevitably align with kernel behavior as the network width grows. This theoretical breakthrough guarantees convergence of the 1-bit model to an arbitrarily small loss as width increases. Furthermore, we introduce the concept of the generalization difference, defined as the gap between the outputs of 1-bit networks and their full-precision counterparts, and demonstrate that this difference maintains a negligible level as network width scales. Building on the work of Kaplan et al. (2020), we conclude by examining how the training loss scales as a power-law function of the model size, dataset size, and computational resources utilized for training. Our findings underscore the promising potential of scaling 1-bit neural networks, suggesting that int1 could become the standard in future neural network precision.

Paper Structure

This paper contains 61 sections, 31 theorems, 144 equations, 3 figures.

Key Result

Lemma 4.1

Assume $\lambda_{\min}(H^*) > 0$. $\delta \in (0, 1)$, define $D := \max\{\sqrt{\log(md/\delta)}, 1\}$. Let $R \leq O(\lambda \delta / (\kappa^2 n^2d D))$, then for any $t \ge 0$, with probability at least $1 - \delta$, we have:

Figures (3)

  • Figure 1: Verification experiment for scaling law for $1$-bit neural networks. Minimum training loss of scaling number of parameters for MLP model to learn complicated functions $f_1, f_2, f_3, f_4, f_5$ and $f_6$, and these function is defined in Section \ref{['sub:exp_scaling_law']}.
  • Figure 2: This plot shows the difference between the predicted and actual values of the functions on the test dataset. We tested three complex functions, as seen in the images, and the performance of the 1-bit model is nearly identical to that of the standard 32-bit floating-point model.
  • Figure 3: This plot shows the $\ell_2$ difference between both the training and test points and the predicted points throughout the training phase for different model sizes and parameter counts. Each plot demonstrates how the error decreases as training progresses, highlighting the impact of model size on both training and test performance.

Theorems & Definitions (88)

  • Lemma 4.1: NTK convergence and PD property during the training, informal version of Lemma \ref{['lem:ntk_convergence']}
  • proof : Proof of Lemma \ref{['lem:ntk_convergence:informal']}
  • Theorem 4.2: Training convergence guarantee, informal version of Theorem \ref{['thm:convergence']}
  • proof : Proof sketch of Theorem \ref{['thm:convergence:informal']}
  • Proposition 4.3: Scaling Law for $1$-Bit Neural Networks
  • proof : Proof of Proposition \ref{['pro:scaling_law']}
  • Lemma 5.1: Function difference at initialization, informal version of Lemma \ref{['lem:bounding_diff']}
  • proof : Proof sketch of Lemma \ref{['lem:bounding_diff:informal']}
  • Theorem 5.2: Training and generalization similarity, informal version of Theorem \ref{['thm:training_similarity']}
  • proof
  • ...and 78 more