Table of Contents
Fetching ...

BBS: Bi-directional Bit-level Sparsity for Deep Learning Acceleration

Yuzong Chen, Jian Meng, Jae-sun Seo, Mohamed S. Abdelfattah

TL;DR

This work introduces Bi-directional Bit-level Sparsity (BBS), a symmetry-aware approach that prunes either zero- or one-bits in bit-serial DNN computations to achieve at least 50% bit sparsity, preserving all 8-bit quantization levels and enabling efficient binary pruning. Coupled with a hardware co-design, BitVert, the method achieves substantial model compression and up to 3.03× speedup with 2.44× energy savings across seven DNN benchmarks, while maintaining negligible accuracy loss. The authors show that balancing bit sparsity across weight groups mitigates load imbalance, and they demonstrate favorable scalability and applicability to transformer and language models, including Llama-3-8B, outperforming value-sparsity and PTQ baselines. Overall, BBS presents a practical, data-free compression pathway that leverages bit-level redundancy for efficient bit-serial acceleration in DNNs and LLMs.

Abstract

Bit-level sparsity methods skip ineffectual zero-bit operations and are typically applicable within bit-serial deep learning accelerators. This type of sparsity at the bit-level is especially interesting because it is both orthogonal and compatible with other deep neural network (DNN) efficiency methods such as quantization and pruning. In this work, we improve the practicality and efficiency of bitlevel sparsity through a novel algorithmic bit-pruning, averaging, and compression method, and a co-designed efficient bit-serial hardware accelerator. On the algorithmic side, we introduce bidirectional bit sparsity (BBS). The key insight of BBS is that we can leverage bit sparsity in a symmetrical way to prune either zero-bits or one-bits. This significantly improves the load balance of bit-serial computing and guarantees the level of sparsity to be more than 50%. On top of BBS, we further propose two bit-level binary pruning methods that require no retraining, and can be seamlessly applied to quantized DNNs. Combining binary pruning with a new tensor encoding scheme, BBS can both skip computation and reduce the memory footprint associated with bi-directional sparse bit columns. On the hardware side, we demonstrate the potential of BBS through BitVert, a bitserial architecture with an efficient PE design to accelerate DNNs with low overhead, exploiting our proposed binary pruning. Evaluation on seven representative DNN models shows that our approach achieves: (1) on average 1.66$\times$ reduction in model sizewith negligible accuracy loss of < 0.5%; (2) up to 3.03$\times$ speedupand 2.44$\times$ energy saving compared to prior DNN accelerators.

BBS: Bi-directional Bit-level Sparsity for Deep Learning Acceleration

TL;DR

This work introduces Bi-directional Bit-level Sparsity (BBS), a symmetry-aware approach that prunes either zero- or one-bits in bit-serial DNN computations to achieve at least 50% bit sparsity, preserving all 8-bit quantization levels and enabling efficient binary pruning. Coupled with a hardware co-design, BitVert, the method achieves substantial model compression and up to 3.03× speedup with 2.44× energy savings across seven DNN benchmarks, while maintaining negligible accuracy loss. The authors show that balancing bit sparsity across weight groups mitigates load imbalance, and they demonstrate favorable scalability and applicability to transformer and language models, including Llama-3-8B, outperforming value-sparsity and PTQ baselines. Overall, BBS presents a practical, data-free compression pathway that leverages bit-level redundancy for efficient bit-serial acceleration in DNNs and LLMs.

Abstract

Bit-level sparsity methods skip ineffectual zero-bit operations and are typically applicable within bit-serial deep learning accelerators. This type of sparsity at the bit-level is especially interesting because it is both orthogonal and compatible with other deep neural network (DNN) efficiency methods such as quantization and pruning. In this work, we improve the practicality and efficiency of bitlevel sparsity through a novel algorithmic bit-pruning, averaging, and compression method, and a co-designed efficient bit-serial hardware accelerator. On the algorithmic side, we introduce bidirectional bit sparsity (BBS). The key insight of BBS is that we can leverage bit sparsity in a symmetrical way to prune either zero-bits or one-bits. This significantly improves the load balance of bit-serial computing and guarantees the level of sparsity to be more than 50%. On top of BBS, we further propose two bit-level binary pruning methods that require no retraining, and can be seamlessly applied to quantized DNNs. Combining binary pruning with a new tensor encoding scheme, BBS can both skip computation and reduce the memory footprint associated with bi-directional sparse bit columns. On the hardware side, we demonstrate the potential of BBS through BitVert, a bitserial architecture with an efficient PE design to accelerate DNNs with low overhead, exploiting our proposed binary pruning. Evaluation on seven representative DNN models shows that our approach achieves: (1) on average 1.66 reduction in model sizewith negligible accuracy loss of < 0.5%; (2) up to 3.03 speedupand 2.44 energy saving compared to prior DNN accelerators.
Paper Structure (23 sections, 2 equations, 17 figures, 6 tables, 2 algorithms)

This paper contains 23 sections, 2 equations, 17 figures, 6 tables, 2 algorithms.

Figures (17)

  • Figure 1: Comparison of different model compression approaches. (a) Example of a 4-value group and the weight distribution of a ResNet-50 layer before and after PTQ. (b) Bit-sparsity enhancement by generating three zero bit columns using sign-magnitude format, achieving lower KL divergence than PTQ but still losing many quantization levels. (c) BBS generates three bi-directional sparse bit columns and is able to preserve all quantization levels of 8-bit precision, leading to much lower KL divergence.
  • Figure 2: High-level computation flow of (a) bit-parallel PE, (b) Pragmatic pragmatic, (c) Bitlet bitlet, (d) BitWave bitwave.
  • Figure 3: Comparison of inherent weight value sparsity, bit sparsity and BBS (with a bit-vector size of 8) in INT8 DNNs.
  • Figure 4: Example of bit-level binary pruning with rounded column averaging to generate 4 sparse bit columns.
  • Figure 5: An example of bit-level binary pruning with zero-point shifting to generate 4 sparse bit columns.
  • ...and 12 more figures