Table of Contents
Fetching ...

Time Frequency Analysis of EMG Signal for Gesture Recognition using Fine grained Features

Parshuram N. Aarotale, Ajita Rattani

TL;DR

This work tackles EMG-based gesture recognition by addressing fine-grained temporal-frequency cues that conventional CNNs often miss. It introduces XMANet, a cross-layer mutual attention network that treats each CNN layer as an expert and exchanges attention across shallow-to-deep layers, augmented by attention-based crops and a mutual learning schedule. Representations from STFT spectrograms and wavelet-based scalograms are used to create rich time-frequency inputs, and experiments on Grabmyo and FORS-EMG show consistent accuracy gains over strong CNN baselines across multiple backbones. The approach improves robustness and accuracy in EMG gesture classification, suggesting strong potential for prosthetic control and human–machine interfaces, with future work towards fairness and self-supervised signal representations.

Abstract

Electromyography (EMG) based hand gesture recognition converts forearm muscle activity into control commands for prosthetics, rehabilitation, and human computer interaction. This paper proposes a novel approach to EMG-based hand gesture recognition that uses fine-grained classification and presents XMANet, which unifies low-level local and high level semantic cues through cross layer mutual attention among shallow to deep CNN experts. Using stacked spectrograms and scalograms derived from the Short Time Fourier Transform (STFT) and Wavelet Transform (WT), we benchmark XMANet against ResNet50, DenseNet-121, MobileNetV3, and EfficientNetB0. Experimental results on the Grabmyo dataset indicate that, using STFT, the proposed XMANet model outperforms the baseline ResNet50, EfficientNetB0, MobileNetV3, and DenseNet121 models with improvement of approximately 1.72%, 4.38%, 5.10%, and 2.53%, respectively. When employing the WT approach, improvements of around 1.57%, 1.88%, 1.46%, and 2.05% are observed over the same baselines. Similarly, on the FORS EMG dataset, the XMANet(ResNet50) model using STFT shows an improvement of about 5.04% over the baseline ResNet50. In comparison, the XMANet(DenseNet121) and XMANet(MobileNetV3) models yield enhancements of approximately 4.11% and 2.81%, respectively. Moreover, when using WT, the proposed XMANet achieves gains of around 4.26%, 9.36%, 5.72%, and 6.09% over the baseline ResNet50, DenseNet121, MobileNetV3, and EfficientNetB0 models, respectively. These results confirm that XMANet consistently improves performance across various architectures and signal processing techniques, demonstrating the strong potential of fine grained features for accurate and robust EMG classification.

Time Frequency Analysis of EMG Signal for Gesture Recognition using Fine grained Features

TL;DR

This work tackles EMG-based gesture recognition by addressing fine-grained temporal-frequency cues that conventional CNNs often miss. It introduces XMANet, a cross-layer mutual attention network that treats each CNN layer as an expert and exchanges attention across shallow-to-deep layers, augmented by attention-based crops and a mutual learning schedule. Representations from STFT spectrograms and wavelet-based scalograms are used to create rich time-frequency inputs, and experiments on Grabmyo and FORS-EMG show consistent accuracy gains over strong CNN baselines across multiple backbones. The approach improves robustness and accuracy in EMG gesture classification, suggesting strong potential for prosthetic control and human–machine interfaces, with future work towards fairness and self-supervised signal representations.

Abstract

Electromyography (EMG) based hand gesture recognition converts forearm muscle activity into control commands for prosthetics, rehabilitation, and human computer interaction. This paper proposes a novel approach to EMG-based hand gesture recognition that uses fine-grained classification and presents XMANet, which unifies low-level local and high level semantic cues through cross layer mutual attention among shallow to deep CNN experts. Using stacked spectrograms and scalograms derived from the Short Time Fourier Transform (STFT) and Wavelet Transform (WT), we benchmark XMANet against ResNet50, DenseNet-121, MobileNetV3, and EfficientNetB0. Experimental results on the Grabmyo dataset indicate that, using STFT, the proposed XMANet model outperforms the baseline ResNet50, EfficientNetB0, MobileNetV3, and DenseNet121 models with improvement of approximately 1.72%, 4.38%, 5.10%, and 2.53%, respectively. When employing the WT approach, improvements of around 1.57%, 1.88%, 1.46%, and 2.05% are observed over the same baselines. Similarly, on the FORS EMG dataset, the XMANet(ResNet50) model using STFT shows an improvement of about 5.04% over the baseline ResNet50. In comparison, the XMANet(DenseNet121) and XMANet(MobileNetV3) models yield enhancements of approximately 4.11% and 2.81%, respectively. Moreover, when using WT, the proposed XMANet achieves gains of around 4.26%, 9.36%, 5.72%, and 6.09% over the baseline ResNet50, DenseNet121, MobileNetV3, and EfficientNetB0 models, respectively. These results confirm that XMANet consistently improves performance across various architectures and signal processing techniques, demonstrating the strong potential of fine grained features for accurate and robust EMG classification.

Paper Structure

This paper contains 27 sections, 11 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Figure illustrates the outline of proposed Method. A) Pre-Processing, B) Feature Extraction and Classification techniques for Gesture Recognition.
  • Figure 2: This figure illustrates XMANet method by introducing three experts $e_{1}$, $e_{2}$, $e_{3}$, on a 5-stage backbone CNN (e.g., ResNet50). The working of each expert and the concatenation of experts are depicted in different colors. Each expert receives feature maps from a specific layer as input and generates a categorical prediction along with an attention region, which is used for data augmentation by other experts. This architecture is trained in multiple steps within each iteration. We start by training the deepest expert (e3), followed by the shallower experts. Finally, in the last step, we train the concatenation of experts to enhance overall performance.
  • Figure 3: A) Electrode Positions pradhan2022multi and B) Gesture list for Grabmyo dataset pradhan2022multi and C) FORS-EMG dataset rumman2024fors.