StruM: Structured Mixed Precision for Efficient Deep Learning Hardware Codesign
Michael Wu, Arnab Raha, Deepak A. Mathaikutty, Martin Langhammer, Engin Tunali, Daksha Sharma
TL;DR
StruM introduces structured mixed precision by partitioning weights into blocks and assigning two precision levels within each block, enabling hardware-friendly inference without retraining. It proposes two quantization strategies, DLIQ and MIP2Q, and demonstrates their effectiveness on a FlexNN-based DPU with barrel-shifter–enabled PEs, achieving substantial power and area savings while preserving accuracy. The approach reduces PE power by 31–34% and PE-array area by 23–26%, with around 10% accelerator-level power savings, and maintains less than 1% accuracy loss across diverse CNNs on ImageNet. These results indicate significant practical impact for efficient DL inference in data centers and edge devices, through co-design of structured precision and hardware.
Abstract
In this paper, we propose StruM, a novel structured mixed-precision-based deep learning inference method, co-designed with its associated hardware accelerator (DPU), to address the escalating computational and memory demands of deep learning workloads in data centers and edge applications. Diverging from traditional approaches, our method avoids time-consuming re-training/fine-tuning and specialized hardware access. By leveraging the variance in weight magnitudes within layers, we quantize values within blocks to two different levels, achieving up to a 50% reduction in precision for 8-bit integer weights to 4-bit values across various Convolutional Neural Networks (CNNs) with negligible loss in inference accuracy. To demonstrate efficiency gains by utilizing mixed precision, we implement StruM on top of our in-house FlexNN DNN accelerator [1] that supports low and mixed-precision execution. Experimental results depict that the proposed StruM-based hardware architecture achieves a 31-34% reduction in processing element (PE) power consumption and a 10% reduction in area at the accelerator level. In addition, the statically configured StruM results in 23-26% area reduction at the PE level and 2-3% area savings at the DPU level.
