Table of Contents
Fetching ...

Runtime Tunable Tsetlin Machines for Edge Inference on eFPGAs

Tousif Rahman, Gang Mao, Bob Pattison, Sidharth Maheshwari, Marcos Sartori, Adrian Wheeldon, Rishad Shafik, Alex Yakovlev

TL;DR

This work addresses edge deployment of machine learning on resource-constrained eFPGAs by leveraging a bitwise, highly compressed Tsetlin Machine (TM). It proposes a runtime-tunable accelerator that uses Include-action compression to minimize LUT/FF usage and fit models in BRAM, enabling on-field recalibration without resynthesis. The design supports multiple configurations (base, single-core, multi-core) and header-based runtime reconfiguration via AXI streaming memory, achieving substantial energy savings and adaptability compared with MCU-based software and prior FPGA approaches like MATADOR. The results demonstrate practical edge deployments for small datasets (e.g., MNIST, CIFAR-2, KWS) with strong energy efficiency and the flexibility to recalibrate in the field when data drift occurs.

Abstract

Embedded Field-Programmable Gate Arrays (eFPGAs) allow for the design of hardware accelerators of edge Machine Learning (ML) applications at a lower power budget compared with traditional FPGA platforms. However, the limited eFPGA logic and memory significantly constrain compute capabilities and model size. As such, ML application deployment on eFPGAs is in direct contrast with the most recent FPGA approaches developing architecture-specific implementations and maximizing throughput over resource frugality. This paper focuses on the opposite side of this trade-off: the proposed eFPGA accelerator focuses on minimizing resource usage and allowing flexibility for on-field recalibration over throughput. This allows for runtime changes in model size, architecture, and input data dimensionality without offline resynthesis. This is made possible through the use of a bitwise compressed inference architecture of the Tsetlin Machine (TM) algorithm. TM compute does not require any multiplication operations, being limited to only bitwise AND, OR, NOT, summations and additions. Additionally, TM model compression allows the entire model to fit within the on-chip block RAM of the eFPGA. The paper uses this accelerator to propose a strategy for runtime model tuning in the field. The proposed approach uses 2.5x fewer Look-up-Tables (LUTs) and 3.38x fewer registers than the current most resource-fugal design and achieves up to 129x energy reduction compared with low-power microcontrollers running the same ML application.

Runtime Tunable Tsetlin Machines for Edge Inference on eFPGAs

TL;DR

This work addresses edge deployment of machine learning on resource-constrained eFPGAs by leveraging a bitwise, highly compressed Tsetlin Machine (TM). It proposes a runtime-tunable accelerator that uses Include-action compression to minimize LUT/FF usage and fit models in BRAM, enabling on-field recalibration without resynthesis. The design supports multiple configurations (base, single-core, multi-core) and header-based runtime reconfiguration via AXI streaming memory, achieving substantial energy savings and adaptability compared with MCU-based software and prior FPGA approaches like MATADOR. The results demonstrate practical edge deployments for small datasets (e.g., MNIST, CIFAR-2, KWS) with strong energy efficiency and the flexibility to recalibrate in the field when data drift occurs.

Abstract

Embedded Field-Programmable Gate Arrays (eFPGAs) allow for the design of hardware accelerators of edge Machine Learning (ML) applications at a lower power budget compared with traditional FPGA platforms. However, the limited eFPGA logic and memory significantly constrain compute capabilities and model size. As such, ML application deployment on eFPGAs is in direct contrast with the most recent FPGA approaches developing architecture-specific implementations and maximizing throughput over resource frugality. This paper focuses on the opposite side of this trade-off: the proposed eFPGA accelerator focuses on minimizing resource usage and allowing flexibility for on-field recalibration over throughput. This allows for runtime changes in model size, architecture, and input data dimensionality without offline resynthesis. This is made possible through the use of a bitwise compressed inference architecture of the Tsetlin Machine (TM) algorithm. TM compute does not require any multiplication operations, being limited to only bitwise AND, OR, NOT, summations and additions. Additionally, TM model compression allows the entire model to fit within the on-chip block RAM of the eFPGA. The paper uses this accelerator to propose a strategy for runtime model tuning in the field. The proposed approach uses 2.5x fewer Look-up-Tables (LUTs) and 3.38x fewer registers than the current most resource-fugal design and achieves up to 129x energy reduction compared with low-power microcontrollers running the same ML application.

Paper Structure

This paper contains 6 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Comparing the proposed design (3480 LUTs configuration) to state-of-the-art accelerator automation flows targeting FPGAs. All accelerators were designed for MNIST. Each vertical line indicates the max LUTs of an off-the-shelf eFPGA platform. This work uses 2.5x fewer LUTs than the next closest work (MATADOR).
  • Figure 2: Core components of the Tsetlin Machine: Input conversion to Boolean literals, the Tsetlin Automata (TA) and Clause compute.
  • Figure 3: 1: The class sum compute in the original TM algorithm. 2: The impact of Includes and Excludes in the Clause Output computation - showing that excludes become redundant during inference. 3: The traversal of a trained TM model using when only considering included TAs. 4: The encoding instruction used to create a compressed TM model adapted from the approach used by REDRESS.
  • Figure 4: Overview of the Proposed Accelerator (Base Version): 1: An incoming data stream to the accelerator; the header packet of the stream is used for configuration. 2: The bit-fields of the header when the data stream contains instructions (the TM model) 3: The bit-field of the header when the data stream contains input Boolean features. 4: The instruction fetching and decoding process. 5: Selecting Boolean literals that match the TA Include actions. 6: Accumulators for the clause outputs and class sums when performing the compressed inference.
  • Figure 5: Timing diagrams of the programming, inference and execution cycle of an instruction (1). The instruction execution cycle (2) is a per-core process - it would be the same for a multi-core version of the accelerator.
  • ...and 4 more figures