Concurrent Self-testing of Neural Networks Using Uncertainty Fingerprint
Soyed Tuhin Ahmed, Mehdi B. tahoori
TL;DR
The paper tackles the need for reliable online operation of neural networks deployed on hardware accelerators by addressing faults in memory elements that store weights and activations. It introduces an uncertainty fingerprint, produced by a dedicated uncertainty head in a dual-head network, and a two-stage training objective to align fault-free fingerprints around unity, enabling single-pass online fault detection through boundary checks. The proposed method achieves high fault coverage, low false positives, and minimal overhead compared with pause-and-test and other concurrent testing approaches, demonstrated across multiple CNN architectures and datasets with various fault models. This approach provides a practical, scalable mechanism for concurrent self-testing in safety-critical NN-HAs, with potential extensions to improve robustness further via contrastive losses and deeper uncertainty heads.
Abstract
Neural networks (NNs) are increasingly used in always-on safety-critical applications deployed on hardware accelerators (NN-HAs) employing various memory technologies. Reliable continuous operation of NN is essential for safety-critical applications. During online operation, NNs are susceptible to single and multiple permanent and soft errors due to factors such as radiation, aging, and thermal effects. Explicit NN-HA testing methods cannot detect transient faults during inference, are unsuitable for always-on applications, and require extensive test vector generation and storage. Therefore, in this paper, we propose the \emph{uncertainty fingerprint} approach representing the online fault status of NN. Furthermore, we propose a dual head NN topology specifically designed to produce uncertainty fingerprints and the primary prediction of the NN in \emph{a single shot}. During the online operation, by matching the uncertainty fingerprint, we can concurrently self-test NNs with up to $100\%$ coverage with a low false positive rate while maintaining a similar performance of the primary task. Compared to existing works, memory overhead is reduced by up to $243.7$ MB, multiply and accumulate (MAC) operation is reduced by up to $10000\times$, and false-positive rates are reduced by up to $89\%$.
