Table of Contents
Fetching ...

WaveFormer: A Lightweight Transformer Model for sEMG-based Gesture Recognition

Yanlong Chen, Mattia Orlandi, Pierangelo Maria Rapa, Simone Benatti, Luca Benini, Yawei Li

TL;DR

WaveFormer addresses the challenge of accurate sEMG-based gesture recognition on resource-constrained devices by integrating a learnable WaveletConv front-end with RoPE-attention in a compact Transformer (3.1M parameters). The model performs multiscale time–frequency analysis and efficient attention, achieving state-of-the-art results on multiple datasets (e.g., 95% on EPN612) and enabling real-time deployment with 6.75 ms latency on CPU using INT8 quantization. Key contributions include a trainable multilevel wavelet decomposition, a residual low-frequency path, and RoPE-based classification, which together provide robustness to session variability and electrode drift. The findings highlight the practical potential for prosthetic control and rehabilitation applications, offering a scalable, frequency-aware approach suitable for wearable devices.

Abstract

Human-machine interaction, particularly in prosthetic and robotic control, has seen progress with gesture recognition via surface electromyographic (sEMG) signals.However, classifying similar gestures that produce nearly identical muscle signals remains a challenge, often reducing classification accuracy. Traditional deep learning models for sEMG gesture recognition are large and computationally expensive, limiting their deployment on resource-constrained embedded systems. In this work, we propose WaveFormer, a lightweight transformer-based architecture tailored for sEMG gesture recognition. Our model integrates time-domain and frequency-domain features through a novel learnable wavelet transform, enhancing feature extraction. In particular, the WaveletConv module, a multi-level wavelet decomposition layer with depthwise separable convolution, ensures both efficiency and compactness. With just 3.1 million parameters, WaveFormer achieves 95% classification accuracy on the EPN612 dataset, outperforming larger models. Furthermore, when profiled on a laptop equipped with an Intel CPU, INT8 quantization achieves real-time deployment with a 6.75 ms inference latency.

WaveFormer: A Lightweight Transformer Model for sEMG-based Gesture Recognition

TL;DR

WaveFormer addresses the challenge of accurate sEMG-based gesture recognition on resource-constrained devices by integrating a learnable WaveletConv front-end with RoPE-attention in a compact Transformer (3.1M parameters). The model performs multiscale time–frequency analysis and efficient attention, achieving state-of-the-art results on multiple datasets (e.g., 95% on EPN612) and enabling real-time deployment with 6.75 ms latency on CPU using INT8 quantization. Key contributions include a trainable multilevel wavelet decomposition, a residual low-frequency path, and RoPE-based classification, which together provide robustness to session variability and electrode drift. The findings highlight the practical potential for prosthetic control and rehabilitation applications, offering a scalable, frequency-aware approach suitable for wearable devices.

Abstract

Human-machine interaction, particularly in prosthetic and robotic control, has seen progress with gesture recognition via surface electromyographic (sEMG) signals.However, classifying similar gestures that produce nearly identical muscle signals remains a challenge, often reducing classification accuracy. Traditional deep learning models for sEMG gesture recognition are large and computationally expensive, limiting their deployment on resource-constrained embedded systems. In this work, we propose WaveFormer, a lightweight transformer-based architecture tailored for sEMG gesture recognition. Our model integrates time-domain and frequency-domain features through a novel learnable wavelet transform, enhancing feature extraction. In particular, the WaveletConv module, a multi-level wavelet decomposition layer with depthwise separable convolution, ensures both efficiency and compactness. With just 3.1 million parameters, WaveFormer achieves 95% classification accuracy on the EPN612 dataset, outperforming larger models. Furthermore, when profiled on a laptop equipped with an Intel CPU, INT8 quantization achieves real-time deployment with a 6.75 ms inference latency.

Paper Structure

This paper contains 23 sections, 7 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of WaveFormer. The raw sEMG input $\mathbf{X}\in\mathbb{R}^{C\times T}$ is first segmented into time-domain patches via a learnable patch projector. These patches are then fed into a WaveletConv module that applies multi-level wavelet decompositions and reconstructions to extract rich, multi-scale features. Next, the resulting representations are passed into a Transformer encoder equipped with RoPEAttention to capture global temporal correlations. Finally, a classification head produces likelihood scores for each hand gesture class, which are used to predict the corresponding gesture.
  • Figure 2: An example WaveletConv pipeline with two levels of wavelet decomposition and reconstruction. Beginning with an input feature map, a discrete wavelet transform (DWT) splits the signal into four sub-bands (LL, LH, HL, HH) at each level. Each sub-band is processed by learnable depthwise convolutions for frequency-specific feature refinement, with optional dropout applied to high-frequency components. The low-frequency (LL) component undergoes recursive decomposition across multiple levels, while high-frequency components are preserved at each scale. After processing all sub-bands, inverse wavelet transforms (IWT) progressively reconstruct the feature map by combining sub-bands from all decomposition levels. A residual connection ensures preservation of essential baseline information. The resulting wavelet-enhanced features are then flattened and fed into the Transformer encoder with RoPEAttention for gesture classification.
  • Figure 3: Ablation study comparing the full WaveFormer model with variants removing the WaveletConv module or rotary embedding across four downstream sEMG datasets.