TsetlinKWS: A 65nm 16.58uW, 0.63mm2 State-Driven Convolutional Tsetlin Machine-Based Accelerator For Keyword Spotting
Baizhou Lin, Yuetong Fang, Renjing Xu, Rishad Shafik, Jagmohan Chauhan
TL;DR
The paper tackles the challenge of achieving competitive keyword spotting performance on ultra-low-power edge devices using a Convolutional Tsetlin Machine (CTM). It introduces a hardware-algorithm co-design comprising a MFSC-SF feature extractor, an Optimized Grouped Block-Compressed Sparse Row (OG-BCSR) compression scheme, and a state-driven accelerator architecture tailored for sparse CTMs. The integrated TsetlinKWS system, implemented in 65 nm CMOS, delivers 87.35% accuracy on a 12-keyword task while consuming only 16.58 µW at 0.7 V and occupying 0.63 mm^2, with 907k operations per inference and substantially higher sparsity utilization. The work demonstrates that CTMs can achieve competitive energy efficiency with near-NN performance for edge speech tasks and outlines practical pathways to further memory and robustness optimizations. Overall, this framework significantly advances ultra-low-power, on-device KWS by aligning algorithmic sparsity with hardware reuse and scheduling strategies for CTMs.
Abstract
The Tsetlin Machine (TM) has recently attracted attention as a low-power alternative to neural networks due to its simple and interpretable inference mechanisms. However, its performance on speech-related tasks remains limited. This paper proposes TsetlinKWS, the first algorithm-hardware co-design framework for the Convolutional Tsetlin Machine (CTM) on the 12-keyword spotting task. Firstly, we introduce a novel Mel-Frequency Spectral Coefficient and Spectral Flux (MFSC-SF) feature extraction scheme together with spectral convolution, enabling the CTM to reach its first-ever competitive accuracy of 87.35% on the 12-keyword spotting task. Secondly, we develop an Optimized Grouped Block-Compressed Sparse Row (OG-BCSR) algorithm that achieves a remarkable 9.84$\times$ reduction in model size, significantly improving the storage efficiency on CTMs. Finally, we propose a state-driven architecture tailored for the CTM, which simultaneously exploits data reuse and sparsity to achieve high energy efficiency. The full system is evaluated in 65 nm process technology, consuming 16.58 $μ$W at 0.7 V with a compact 0.63 mm$^2$ core area. TsetlinKWS requires only 907k logic operations per inference, representing a 10$\times$ reduction compared to the state-of-the-art KWS accelerators, positioning the CTM as a highly-efficient candidate for ultra-low-power speech applications.
