OvSW: Overcoming Silent Weights for Accurate Binary Neural Networks

Jingyang Xiang; Zuohui Chen; Siqi Li; Qing Wu; Yong Liu

OvSW: Overcoming Silent Weights for Accurate Binary Neural Networks

Jingyang Xiang, Zuohui Chen, Siqi Li, Qing Wu, Yong Liu

TL;DR

OvSW tackles silent weights in binary neural networks by proving that gradient updates are largely independent of latent weight distributions, which causes many weights to stop flipping signs during training. It introduces Adaptive Gradient Scaling (AGS) and Silence Awareness Decaying (SAD) to align gradient updates with weight distributions and to penalize persistently silent weights, respectively. Empirically, OvSW achieves state-of-the-art 1-bit accuracy on CIFAR10 and ImageNet1K across several architectures, including 61.6% top-1 on ResNet18 and 65.5% on ResNet34 on ImageNet1K, with further gains in two-step training. The approach is lightweight and compatible with existing BNN improvements, offering faster convergence and practical deployment on edge devices.

Abstract

Binary Neural Networks~(BNNs) have been proven to be highly effective for deploying deep neural networks on mobile and embedded platforms. Most existing works focus on minimizing quantization errors, improving representation ability, or designing gradient approximations to alleviate gradient mismatch in BNNs, while leaving the weight sign flipping, a critical factor for achieving powerful BNNs, untouched. In this paper, we investigate the efficiency of weight sign updates in BNNs. We observe that, for vanilla BNNs, over 50\% of the weights remain their signs unchanged during training, and these weights are not only distributed at the tails of the weight distribution but also universally present in the vicinity of zero. We refer to these weights as ``silent weights'', which slow down convergence and lead to a significant accuracy degradation. Theoretically, we reveal this is due to the independence of the BNNs gradient from the latent weight distribution. To address the issue, we propose Overcome Silent Weights~(OvSW). OvSW first employs Adaptive Gradient Scaling~(AGS) to establish a relationship between the gradient and the latent weight distribution, thereby improving the overall efficiency of weight sign updates. Additionally, we design Silence Awareness Decaying~(SAD) to automatically identify ``silent weights'' by tracking weight flipping state, and apply an additional penalty to ``silent weights'' to facilitate their flipping. By efficiently updating weight signs, our method achieves faster convergence and state-of-the-art performance on CIFAR10 and ImageNet1K dataset with various architectures. For example, OvSW obtains 61.6\% and 65.5\% top-1 accuracy on the ImageNet1K using binarized ResNet18 and ResNet34 architecture respectively. Codes are available at \url{https://github.com/JingyangXiang/OvSW}.

OvSW: Overcoming Silent Weights for Accurate Binary Neural Networks

TL;DR

Abstract

Paper Structure (22 sections, 22 equations, 12 figures, 4 tables, 3 algorithms)

This paper contains 22 sections, 22 equations, 12 figures, 4 tables, 3 algorithms.

Introduction
Related Work
Background
Methodology
The Independence of the Gradient and Weight Distribution
Adaptive Gradient Scaling
Silence Awareness Decaying
Experiment
Results on CIFAR10
Results on ImageNet1K
Ablation Analysis
Loss Landscape Visualization
Deployment Efficiency
Conclusion
More Experimental Results
...and 7 more sections

Figures (12)

Figure 1: (a) and (b): Histogram of the initialized weight distribution (blue) and the weights that never update signs throughout training (orange). 54.07% and 2.03% represent the ratio of the corresponding orange area to the blue. (c) Top-1 Accuracy (solid lines) and Flip Ratio (dashed lines) are for Vanilla (green) and OvSW (red). (a), (b) and Flip Ratio in (c) is for layer4.conv2.weight.
Figure 2: Forward and backward computation graph for binary convolutional operation with quantization aware training.
Figure 3: Top-1 accuracy (mean$\pm$std) of binarized ResNet18 w.r.t. different values of $\lambda$ (a), $\sigma$ (b) and epoch (c) on CIFAR100.
Figure 4: 3D visualization of the loss surfaces of ResNet18 on CIFAR100, which is used to enable comparisons of sharpness/flatness of different methods.
Figure 5: Histogram of the initialized weight distribution (blue) and the weights that never update signs throughout training (orange) for Vanilla BNNs. 37.02%, 46.02% and 40.44% represent the ratio of the corresponding orange area to the blue.
...and 7 more figures

OvSW: Overcoming Silent Weights for Accurate Binary Neural Networks

TL;DR

Abstract

OvSW: Overcoming Silent Weights for Accurate Binary Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Figures (12)