QoS-Nets: Adaptive Approximate Neural Network Inference
Elias Trommer, Bernd Waschneck, Akash Kumar
TL;DR
QoS-Nets addresses runtime quality-of-service scaling for neural networks by adaptively selecting a fixed subset of approximate multipliers from a large search space and enabling retraining to maximize task performance. It introduces a two-stage approach: (i) constrain the multiplier space to at most $n$ instances using $k$-means clustering on layer-wise accuracy preferences, and (ii) derive multiple operating points by scaling these preferences, enabling seamless switching between accuracy and resource usage. A memory-efficient fine-tuning scheme then retrains only BatchNormalization parameters (keeping weights fixed) to align performance across operating points, achieving near-full retraining accuracy with minimal parameter overhead. Empirical results on ResNet variants (CIFAR-10/100) and MobileNetV2 (TinyImageNet) show substantial power savings (roughly 15–43%) with modest Top-5 accuracy losses (0.3–2.33 percentage points) and a small memory overhead (about 2.75% for three points). The work demonstrates a hardware-agnostic framework that unifies multiplier selection, multi-point operation, and lightweight retraining to enable graceful QoS scaling in neural accelerators.
Abstract
In order to vary the arithmetic resource consumption of neural network applications at runtime, this work proposes the flexible reuse of approximate multipliers for neural network layer computations. We introduce a search algorithm that chooses an appropriate subset of approximate multipliers of a user-defined size from a larger search space and enables retraining to maximize task performance. Unlike previous work, our approach can output more than a single, static assignment of approximate multiplier instances to layers. These different operating points allow a system to gradually adapt its Quality of Service (QoS) to changing environmental conditions by increasing or decreasing its accuracy and resource consumption. QoS-Nets achieves this by reassigning the selected approximate multiplier instances to layers at runtime. To combine multiple operating points with the use of retraining, we propose a fine-tuning scheme that shares the majority of parameters between operating points, with only a small amount of additional parameters required per operating point. In our evaluation on MobileNetV2, QoS-Nets is used to select four approximate multiplier instances for three different operating points. These operating points result in power savings for multiplications between 15.3% and 42.8% at a Top-5 accuracy loss between 0.3 and 2.33 percentage points. Through our fine-tuning scheme, all three operating points only increase the model's parameter count by only 2.75%.
