Table of Contents
Fetching ...

Knowledge Distillation for Reservoir-based Classifier: Human Activity Recognition

Masaharu Kagiyama, Tsuyoshi Okita

TL;DR

This work tackles the energy cost of HAR on edge devices by introducing PatchEchoClassifier, a reservoir-based time-series classifier that uses a patch tokenizer and an ESN reservoir. It trains the lightweight ESN student via knowledge distillation from a high-capacity 1DMLP-Mixer teacher through a Mixer-Echo State Signal Distillation framework with class and distillation tokens. The authors demonstrate that PatchEchoClassifier achieves above 80% accuracy while substantially reducing FLOPS, memory footprint, and energy metrics compared with CNN and transformer-based baselines, highlighting its suitability for real-time edge deployment. The study also discusses limitations, such as large Python library footprints, and outlines future directions including reservoir enhancements and quantization to further improve energy efficiency.

Abstract

This paper aims to develop an energy-efficient classifier for time-series data by introducing PatchEchoClassifier, a novel model that leverages a reservoir-based mechanism known as the Echo State Network (ESN). The model is designed for human activity recognition (HAR) using one-dimensional sensor signals and incorporates a tokenizer to extract patch-level representations. To train the model efficiently, we propose a knowledge distillation framework that transfers knowledge from a high-capacity MLP-Mixer teacher to the lightweight reservoir-based student model. Experimental evaluations on multiple HAR datasets demonstrate that our model achieves over 80 percent accuracy while significantly reducing computational cost. Notably, PatchEchoClassifier requires only about one-sixth of the floating point operations (FLOPS) compared to DeepConvLSTM, a widely used convolutional baseline. These results suggest that PatchEchoClassifier is a promising solution for real-time and energy-efficient human activity recognition in edge computing environments.

Knowledge Distillation for Reservoir-based Classifier: Human Activity Recognition

TL;DR

This work tackles the energy cost of HAR on edge devices by introducing PatchEchoClassifier, a reservoir-based time-series classifier that uses a patch tokenizer and an ESN reservoir. It trains the lightweight ESN student via knowledge distillation from a high-capacity 1DMLP-Mixer teacher through a Mixer-Echo State Signal Distillation framework with class and distillation tokens. The authors demonstrate that PatchEchoClassifier achieves above 80% accuracy while substantially reducing FLOPS, memory footprint, and energy metrics compared with CNN and transformer-based baselines, highlighting its suitability for real-time edge deployment. The study also discusses limitations, such as large Python library footprints, and outlines future directions including reservoir enhancements and quantization to further improve energy efficiency.

Abstract

This paper aims to develop an energy-efficient classifier for time-series data by introducing PatchEchoClassifier, a novel model that leverages a reservoir-based mechanism known as the Echo State Network (ESN). The model is designed for human activity recognition (HAR) using one-dimensional sensor signals and incorporates a tokenizer to extract patch-level representations. To train the model efficiently, we propose a knowledge distillation framework that transfers knowledge from a high-capacity MLP-Mixer teacher to the lightweight reservoir-based student model. Experimental evaluations on multiple HAR datasets demonstrate that our model achieves over 80 percent accuracy while significantly reducing computational cost. Notably, PatchEchoClassifier requires only about one-sixth of the floating point operations (FLOPS) compared to DeepConvLSTM, a widely used convolutional baseline. These results suggest that PatchEchoClassifier is a promising solution for real-time and energy-efficient human activity recognition in edge computing environments.

Paper Structure

This paper contains 14 sections, 8 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: An overview of the proposed distillation method, Mixer-Echo State Signal Distillation. The input signals are fed into both the student and teacher models. Within the student model, after passing through the tokenizer layer, class and distillation tokens are added to the token set. These two tokens are then fed into separate heads to obtain the outputs. The output from the class token is used to compute the classification loss $L_{CE}$ along with the ground truth labels of the sensor data. The output from the distillation token is used to compute the distillation loss $L_{Dist}$ in conjunction with the output from the teacher model. The loss function of the proposed method is defined as a linear combination $L$ of $L_{CE}$ and $L_{Dist}$, and learning is performed to minimize this function.
  • Figure 2: The architecture of PatchEchoClassifier. As shown in Figure \ref{['fig:dist']}, the tokenizer layer divides continuous signals with multiple channels into fixed intervals. Then, the sensor data, along with the class token and distillation token used in this method, are input into the Echo State Network (ESN). The ESN consists of a reservoir layer where matrix operations are performed. The graph of the reservoir layer represents the weight matrix of the ESN, and the number of nodes is determined by the size of the ESN. The two types of tokens used for distillation are then fed into their respective heads to obtain the outputs. During distillation training, the outputs from each head are used with separate loss functions. However, during inference, the class prediction is obtained by averaging the outputs from each head.
  • Figure 3: The architecture of PatchMixerClassifier. In the patch embedding layer, a continuous signal with multiple channels is segmented and position embedding is applied. Then, the sensor data, along with the class token and distillation token used in this method, are input into the Mixer Layers. In this experiment, the number of MixerLayers was set to 8. The two types of tokens used for distillation are then fed into their respective heads to obtain the outputs. During distillation training, the outputs from each head are used with separate loss functions. However, during inference, the class prediction is obtained by averaging the outputs from each head.
  • Figure 4: For each model, a line is plotted with accuracy on the vertical axis and the normalized value of the Energy Efficiency Score (EES) on the horizontal axis. Each point is labeled with the corresponding model name. The EES calculation is based on the balanced setting with $(\alpha, \beta, \gamma) = (1/3, 1/3, 1/3)$.
  • Figure 5: For each model, a line is plotted with accuracy on the vertical axis and the normalized value of the Energy Efficeincy Score (EES) on the horizontal axis. Each point is labeled with the corresponding model name. The EES calculation is based on the power-saving setting with $(\alpha, \beta, \gamma) = (0.7, 0.2, 0.1)$.