Table of Contents
Fetching ...

Fast and Slow Gradient Approximation for Binary Neural Network Optimization

Xinquan Chen, Junqi Gao, Biqing Qi, Dong Li, Yiang Luo, Fangyuan Li, Pengfei Li

TL;DR

This work tackles the gradient estimation challenge in Binary Neural Networks caused by non-differentiable quantization. It introduces Historical Gradient Storage (HGS) to capture gradient history and a Fast and Slow Gradient Generation (FSG) framework with fast-net and slow-net, augmented by Layer Recognition Embeddings to produce layer-specific gradients, enabling more accurate optimization. The authors provide a convergence analysis and demonstrate through CIFAR-10/100 experiments that FSG yields faster convergence and lower loss than several baselines, outperforming state-of-the-art hypernetwork-based methods. The approach offers a practical path to improved training of quantized networks on resource-constrained devices, with code and reproducibility details provided.

Abstract

Binary Neural Networks (BNNs) have garnered significant attention due to their immense potential for deployment on edge devices. However, the non-differentiability of the quantization function poses a challenge for the optimization of BNNs, as its derivative cannot be backpropagated. To address this issue, hypernetwork based methods, which utilize neural networks to learn the gradients of non-differentiable quantization functions, have emerged as a promising approach due to their adaptive learning capabilities to reduce estimation errors. However, existing hypernetwork based methods typically rely solely on current gradient information, neglecting the influence of historical gradients. This oversight can lead to accumulated gradient errors when calculating gradient momentum during optimization. To incorporate historical gradient information, we design a Historical Gradient Storage (HGS) module, which models the historical gradient sequence to generate the first-order momentum required for optimization. To further enhance gradient generation in hypernetworks, we propose a Fast and Slow Gradient Generation (FSG) method. Additionally, to produce more precise gradients, we introduce Layer Recognition Embeddings (LRE) into the hypernetwork, facilitating the generation of layer-specific fine gradients. Extensive comparative experiments on the CIFAR-10 and CIFAR-100 datasets demonstrate that our method achieves faster convergence and lower loss values, outperforming existing baselines.Code is available at http://github.com/two-tiger/FSG .

Fast and Slow Gradient Approximation for Binary Neural Network Optimization

TL;DR

This work tackles the gradient estimation challenge in Binary Neural Networks caused by non-differentiable quantization. It introduces Historical Gradient Storage (HGS) to capture gradient history and a Fast and Slow Gradient Generation (FSG) framework with fast-net and slow-net, augmented by Layer Recognition Embeddings to produce layer-specific gradients, enabling more accurate optimization. The authors provide a convergence analysis and demonstrate through CIFAR-10/100 experiments that FSG yields faster convergence and lower loss than several baselines, outperforming state-of-the-art hypernetwork-based methods. The approach offers a practical path to improved training of quantized networks on resource-constrained devices, with code and reproducibility details provided.

Abstract

Binary Neural Networks (BNNs) have garnered significant attention due to their immense potential for deployment on edge devices. However, the non-differentiability of the quantization function poses a challenge for the optimization of BNNs, as its derivative cannot be backpropagated. To address this issue, hypernetwork based methods, which utilize neural networks to learn the gradients of non-differentiable quantization functions, have emerged as a promising approach due to their adaptive learning capabilities to reduce estimation errors. However, existing hypernetwork based methods typically rely solely on current gradient information, neglecting the influence of historical gradients. This oversight can lead to accumulated gradient errors when calculating gradient momentum during optimization. To incorporate historical gradient information, we design a Historical Gradient Storage (HGS) module, which models the historical gradient sequence to generate the first-order momentum required for optimization. To further enhance gradient generation in hypernetworks, we propose a Fast and Slow Gradient Generation (FSG) method. Additionally, to produce more precise gradients, we introduce Layer Recognition Embeddings (LRE) into the hypernetwork, facilitating the generation of layer-specific fine gradients. Extensive comparative experiments on the CIFAR-10 and CIFAR-100 datasets demonstrate that our method achieves faster convergence and lower loss values, outperforming existing baselines.Code is available at http://github.com/two-tiger/FSG .

Paper Structure

This paper contains 24 sections, 2 theorems, 38 equations, 2 figures, 5 tables, 1 algorithm.

Key Result

Theorem 1

(Convergence of FSG) Let formula corporate run $t$ iterations. By setting $\alpha=\frac{C}{\sqrt{t+1}}$, Where $C, \Omega, \kappa,\Theta, \rho$ is a normal number, $\mathbf{x}^*$ is the optimal solution and $\widehat{\mathbf{x}}_t=\sum_{k=0}^t\mathbf{x}_k/(t+1)$.

Figures (2)

  • Figure 1: Fast and Slow Gradient Generation Illustration. Take ResNet as an example. During the backpropagation, the weight gradients from the previous iteration step are fed into the HGS and fast-net. Fast-net uses MLP to learn the scale of the weights, thereby obtaining the fast grad. The slow-net receives the historical gradient sequence from HGS and adds a LRE vector at the front of the sequence, then uses the mamba block to generate the slow grad. Ultimately, the slow grad and fast grad are combined through a weighted sum to generate the final gradients, replacing the non-differentiable parts (indicated by the blue dashed arrows). The forward process will be explained in Section Training of FSG.
  • Figure 2: (a) Loss Curve of ResNet44 on CIFAR-10 Dataset with SGD Optimizer. (b) Loss Curve of ResNet44 on CIFAR-10 Dataset with Adam Optimizer. (c) Accuracy by Hyperparameter Beta. (d) Accuracy by Hyperparameter Length.

Theorems & Definitions (5)

  • Theorem 1
  • Remark
  • Lemma 2
  • proof
  • proof