Table of Contents
Fetching ...

TsetlinKWS: A 65nm 16.58uW, 0.63mm2 State-Driven Convolutional Tsetlin Machine-Based Accelerator For Keyword Spotting

Baizhou Lin, Yuetong Fang, Renjing Xu, Rishad Shafik, Jagmohan Chauhan

TL;DR

The paper tackles the challenge of achieving competitive keyword spotting performance on ultra-low-power edge devices using a Convolutional Tsetlin Machine (CTM). It introduces a hardware-algorithm co-design comprising a MFSC-SF feature extractor, an Optimized Grouped Block-Compressed Sparse Row (OG-BCSR) compression scheme, and a state-driven accelerator architecture tailored for sparse CTMs. The integrated TsetlinKWS system, implemented in 65 nm CMOS, delivers 87.35% accuracy on a 12-keyword task while consuming only 16.58 µW at 0.7 V and occupying 0.63 mm^2, with 907k operations per inference and substantially higher sparsity utilization. The work demonstrates that CTMs can achieve competitive energy efficiency with near-NN performance for edge speech tasks and outlines practical pathways to further memory and robustness optimizations. Overall, this framework significantly advances ultra-low-power, on-device KWS by aligning algorithmic sparsity with hardware reuse and scheduling strategies for CTMs.

Abstract

The Tsetlin Machine (TM) has recently attracted attention as a low-power alternative to neural networks due to its simple and interpretable inference mechanisms. However, its performance on speech-related tasks remains limited. This paper proposes TsetlinKWS, the first algorithm-hardware co-design framework for the Convolutional Tsetlin Machine (CTM) on the 12-keyword spotting task. Firstly, we introduce a novel Mel-Frequency Spectral Coefficient and Spectral Flux (MFSC-SF) feature extraction scheme together with spectral convolution, enabling the CTM to reach its first-ever competitive accuracy of 87.35% on the 12-keyword spotting task. Secondly, we develop an Optimized Grouped Block-Compressed Sparse Row (OG-BCSR) algorithm that achieves a remarkable 9.84$\times$ reduction in model size, significantly improving the storage efficiency on CTMs. Finally, we propose a state-driven architecture tailored for the CTM, which simultaneously exploits data reuse and sparsity to achieve high energy efficiency. The full system is evaluated in 65 nm process technology, consuming 16.58 $μ$W at 0.7 V with a compact 0.63 mm$^2$ core area. TsetlinKWS requires only 907k logic operations per inference, representing a 10$\times$ reduction compared to the state-of-the-art KWS accelerators, positioning the CTM as a highly-efficient candidate for ultra-low-power speech applications.

TsetlinKWS: A 65nm 16.58uW, 0.63mm2 State-Driven Convolutional Tsetlin Machine-Based Accelerator For Keyword Spotting

TL;DR

The paper tackles the challenge of achieving competitive keyword spotting performance on ultra-low-power edge devices using a Convolutional Tsetlin Machine (CTM). It introduces a hardware-algorithm co-design comprising a MFSC-SF feature extractor, an Optimized Grouped Block-Compressed Sparse Row (OG-BCSR) compression scheme, and a state-driven accelerator architecture tailored for sparse CTMs. The integrated TsetlinKWS system, implemented in 65 nm CMOS, delivers 87.35% accuracy on a 12-keyword task while consuming only 16.58 µW at 0.7 V and occupying 0.63 mm^2, with 907k operations per inference and substantially higher sparsity utilization. The work demonstrates that CTMs can achieve competitive energy efficiency with near-NN performance for edge speech tasks and outlines practical pathways to further memory and robustness optimizations. Overall, this framework significantly advances ultra-low-power, on-device KWS by aligning algorithmic sparsity with hardware reuse and scheduling strategies for CTMs.

Abstract

The Tsetlin Machine (TM) has recently attracted attention as a low-power alternative to neural networks due to its simple and interpretable inference mechanisms. However, its performance on speech-related tasks remains limited. This paper proposes TsetlinKWS, the first algorithm-hardware co-design framework for the Convolutional Tsetlin Machine (CTM) on the 12-keyword spotting task. Firstly, we introduce a novel Mel-Frequency Spectral Coefficient and Spectral Flux (MFSC-SF) feature extraction scheme together with spectral convolution, enabling the CTM to reach its first-ever competitive accuracy of 87.35% on the 12-keyword spotting task. Secondly, we develop an Optimized Grouped Block-Compressed Sparse Row (OG-BCSR) algorithm that achieves a remarkable 9.84 reduction in model size, significantly improving the storage efficiency on CTMs. Finally, we propose a state-driven architecture tailored for the CTM, which simultaneously exploits data reuse and sparsity to achieve high energy efficiency. The full system is evaluated in 65 nm process technology, consuming 16.58 W at 0.7 V with a compact 0.63 mm core area. TsetlinKWS requires only 907k logic operations per inference, representing a 10 reduction compared to the state-of-the-art KWS accelerators, positioning the CTM as a highly-efficient candidate for ultra-low-power speech applications.

Paper Structure

This paper contains 26 sections, 17 figures, 1 table.

Figures (17)

  • Figure 1: NN-based KWS accelerator categories. (a) Audio signal. (b) Batch-based accelerators are represented by convolutional neural networks liu20233,lin202316,liu202022nm,bernardo2020ultratrail. They perform inference in a multi-frame manner because neurons do not have states. (c) Frame-based accelerators are represented by recurrent neural networks chong20212,frenkel2022reckon,chen2024deltakws,giraldo2020vocell. They process only one frame of features per inference.
  • Figure 2: The structure of the Tsetlin Machine. $X_k$ refers to the Boolean input. Included TAs and excluded TAs determine the matching pattern of clause $C_i^j$.
  • Figure 3: Two common energy-efficient edge accelerator architectures. (a) Data-driven architecture: efficient for data reuse but inflexible for sparse acceleration. (b) Event-driven architecture: efficient for sparse acceleration but limited in data reuse.
  • Figure 4: The system architecture of TsetlinKWS.
  • Figure 5: Feature extraction process: (a) Traditional MFCC flow. (b) The proposed MFSC-SF flow. An additional overlap difference is proposed to achieve the calculation of spectral flux, and some sub-modules are deleted.
  • ...and 12 more figures