Stochastic Multivariate Universal-Radix Finite-State Machine: a Theoretically and Practically Elegant Nonlinear Function Approximator
Xincheng Feng, Guodong Shen, Jianhao Hu, Meng Li, Ngai Wong
TL;DR
This work addresses the hardware burden of nonlinear function computations in AI by introducing SMURF, a stochastic multivariate universal-radix FSM that uses stochastic computing to approximate multivariate nonlinear functions with low area and power. It derives steady-state probabilities and convex-optimization-based weight tuning for univariate and multivariate targets, and demonstrates accurate approximations of functions such as Euclidean distance, Hartley transform, and softmax, as well as their integration into a CNN. Across software and FPGA benchmarks, SMURF achieves comparable accuracy to conventional methods while reducing area to about $16.07\%$ and power to about $14.45\%$ of Taylor-series, and to $2.22\%$ of LUT-based schemes, highlighting strong potential for energy-efficient edge AI. The results substantiate SMURF as a versatile, hardware-friendly nonlinear function engine capable of handling multiple outputs from a single architecture with configurable parameters.
Abstract
Nonlinearities are crucial for capturing complex input-output relationships especially in deep neural networks. However, nonlinear functions often incur various hardware and compute overheads. Meanwhile, stochastic computing (SC) has emerged as a promising approach to tackle this challenge by trading output precision for hardware simplicity. To this end, this paper proposes a first-of-its-kind stochastic multivariate universal-radix finite-state machine (SMURF) that harnesses SC for hardware-simplistic multivariate nonlinear function generation at high accuracy. We present the finite-state machine (FSM) architecture for SMURF, as well as analytical derivations of sampling gate coefficients for accurately approximating generic nonlinear functions. Experiments demonstrate the superiority of SMURF, requiring only 16.07% area and 14.45% power consumption of Taylor-series approximation, and merely 2.22% area of look-up table (LUT) schemes.
