BBS: Bi-directional Bit-level Sparsity for Deep Learning Acceleration
Yuzong Chen, Jian Meng, Jae-sun Seo, Mohamed S. Abdelfattah
TL;DR
This work introduces Bi-directional Bit-level Sparsity (BBS), a symmetry-aware approach that prunes either zero- or one-bits in bit-serial DNN computations to achieve at least 50% bit sparsity, preserving all 8-bit quantization levels and enabling efficient binary pruning. Coupled with a hardware co-design, BitVert, the method achieves substantial model compression and up to 3.03× speedup with 2.44× energy savings across seven DNN benchmarks, while maintaining negligible accuracy loss. The authors show that balancing bit sparsity across weight groups mitigates load imbalance, and they demonstrate favorable scalability and applicability to transformer and language models, including Llama-3-8B, outperforming value-sparsity and PTQ baselines. Overall, BBS presents a practical, data-free compression pathway that leverages bit-level redundancy for efficient bit-serial acceleration in DNNs and LLMs.
Abstract
Bit-level sparsity methods skip ineffectual zero-bit operations and are typically applicable within bit-serial deep learning accelerators. This type of sparsity at the bit-level is especially interesting because it is both orthogonal and compatible with other deep neural network (DNN) efficiency methods such as quantization and pruning. In this work, we improve the practicality and efficiency of bitlevel sparsity through a novel algorithmic bit-pruning, averaging, and compression method, and a co-designed efficient bit-serial hardware accelerator. On the algorithmic side, we introduce bidirectional bit sparsity (BBS). The key insight of BBS is that we can leverage bit sparsity in a symmetrical way to prune either zero-bits or one-bits. This significantly improves the load balance of bit-serial computing and guarantees the level of sparsity to be more than 50%. On top of BBS, we further propose two bit-level binary pruning methods that require no retraining, and can be seamlessly applied to quantized DNNs. Combining binary pruning with a new tensor encoding scheme, BBS can both skip computation and reduce the memory footprint associated with bi-directional sparse bit columns. On the hardware side, we demonstrate the potential of BBS through BitVert, a bitserial architecture with an efficient PE design to accelerate DNNs with low overhead, exploiting our proposed binary pruning. Evaluation on seven representative DNN models shows that our approach achieves: (1) on average 1.66$\times$ reduction in model sizewith negligible accuracy loss of < 0.5%; (2) up to 3.03$\times$ speedupand 2.44$\times$ energy saving compared to prior DNN accelerators.
