Coding for Computation: Efficient Compression of Neural Networks for Reconfigurable Hardware
Hans Rosenberger, Rodrigo Fischer, Johanna S. Fröhlich, Ali Bereyhi, Ralf R. Müller
TL;DR
The paper tackles the high computational cost of deep neural networks by shifting focus from weight storage to minimizing additions in matrix-vector multiplications during inference. It proposes a hardware-oriented compression pipeline that combines pruning via group Lasso regularization, weight sharing through affinity-propagation clustering, and Linear Computation Coding (LCC) to factorize large weight matrices into tall, bit-shift-friendly components. Across MNIST-scale MLPs and a ResNet-34 trained on TinyImageNet, the approach yields 2×–3× reductions in the number of additions while preserving comparable accuracy, with the fs-LCC variant particularly effective for pruned, larger networks. The result is a practical, FPGA-friendly compression framework that enhances inference efficiency on reconfigurable hardware without significantly sacrificing predictive performance.
Abstract
As state of the art neural networks (NNs) continue to grow in size, their resource-efficient implementation becomes ever more important. In this paper, we introduce a compression scheme that reduces the number of computations required for NN inference on reconfigurable hardware such as FPGAs. This is achieved by combining pruning via regularized training, weight sharing and linear computation coding (LCC). Contrary to common NN compression techniques, where the objective is to reduce the memory used for storing the weights of the NNs, our approach is optimized to reduce the number of additions required for inference in a hardware-friendly manner. The proposed scheme achieves competitive performance for simple multilayer perceptrons, as well as for large scale deep NNs such as ResNet-34.
