Table of Contents
Fetching ...

A Novel Approach To Implementing Knowledge Distillation In Tsetlin Machines

Calvin Kinateder

TL;DR

The paper tackles the challenge of maintaining high accuracy in resource-constrained Tsetlin Machines by introducing two knowledge distillation frameworks: Clause-Based KD (CKD) and Distribution-Enhanced KD (DKD). CKD leverages teacher clause outputs as extra features for the student, using probabilistic downsampling to curb training-time growth, while DKD combines clause initialization with soft-label–driven training inspired by neural KD to boost the student’s performance without adding latency. Across MNIST, KMNIST, EMNIST, and IMDB, DKD generally outperforms CKD and the baseline student, achieving accuracy close to or sometimes surpassing the teacher with similar or reduced training and inference times. The work demonstrates that knowledge distillation for TM can yield practical performance gains for edge-oriented deployments, aided by tunable parameters (temperature $\tau$, balance $\alpha$, and weight-transfer $z$) and a clause-transfer initialization strategy. The contributions include a novel DKD pipeline, an IntelligentTransfer clause initialization, probabilistic clause downsampling, and soft-label–based training adapted to the TM setting, expanding the applicability of KD to propositional-logic models.

Abstract

The Tsetlin Machine (TM) is a propositional logic based model that uses conjunctive clauses to learn patterns from data. As with typical neural networks, the performance of a Tsetlin Machine is largely dependent on its parameter count, with a larger number of parameters producing higher accuracy but slower execution. Knowledge distillation in neural networks transfers information from an already-trained teacher model to a smaller student model to increase accuracy in the student without increasing execution time. We propose a novel approach to implementing knowledge distillation in Tsetlin Machines by utilizing the probability distributions of each output sample in the teacher to provide additional context to the student. Additionally, we propose a novel clause-transfer algorithm that weighs the importance of each clause in the teacher and initializes the student with only the most essential data. We find that our algorithm can significantly improve performance in the student model without negatively impacting latency in the tested domains of image recognition and text classification.

A Novel Approach To Implementing Knowledge Distillation In Tsetlin Machines

TL;DR

The paper tackles the challenge of maintaining high accuracy in resource-constrained Tsetlin Machines by introducing two knowledge distillation frameworks: Clause-Based KD (CKD) and Distribution-Enhanced KD (DKD). CKD leverages teacher clause outputs as extra features for the student, using probabilistic downsampling to curb training-time growth, while DKD combines clause initialization with soft-label–driven training inspired by neural KD to boost the student’s performance without adding latency. Across MNIST, KMNIST, EMNIST, and IMDB, DKD generally outperforms CKD and the baseline student, achieving accuracy close to or sometimes surpassing the teacher with similar or reduced training and inference times. The work demonstrates that knowledge distillation for TM can yield practical performance gains for edge-oriented deployments, aided by tunable parameters (temperature , balance , and weight-transfer ) and a clause-transfer initialization strategy. The contributions include a novel DKD pipeline, an IntelligentTransfer clause initialization, probabilistic clause downsampling, and soft-label–based training adapted to the TM setting, expanding the applicability of KD to propositional-logic models.

Abstract

The Tsetlin Machine (TM) is a propositional logic based model that uses conjunctive clauses to learn patterns from data. As with typical neural networks, the performance of a Tsetlin Machine is largely dependent on its parameter count, with a larger number of parameters producing higher accuracy but slower execution. Knowledge distillation in neural networks transfers information from an already-trained teacher model to a smaller student model to increase accuracy in the student without increasing execution time. We propose a novel approach to implementing knowledge distillation in Tsetlin Machines by utilizing the probability distributions of each output sample in the teacher to provide additional context to the student. Additionally, we propose a novel clause-transfer algorithm that weighs the importance of each clause in the teacher and initializes the student with only the most essential data. We find that our algorithm can significantly improve performance in the student model without negatively impacting latency in the tested domains of image recognition and text classification.

Paper Structure

This paper contains 43 sections, 22 equations, 30 figures, 7 tables, 4 algorithms.

Figures (30)

  • Figure 1: A visual representation of knowledge distillation.
  • Figure 2: Typical layout of a neural network.
  • Figure 3: A typical neural network layout with softmax.
  • Figure 4: Inference structure for a binary output Testlin machine.
  • Figure 5: TM learning dynamics with input ($x_1=0, x_2=1$) and output $y=1$.
  • ...and 25 more figures