Table of Contents
Fetching ...

ReactEMG: Stable, Low-Latency Intent Detection from sEMG via Masked Modeling

Runsheng Wang, Xinyue Zhu, Ava Chen, Jingxi Xu, Lauren Winterbottom, Dawn M. Nilsen, Joel Stein, Matei Ciocarlie

TL;DR

ReactEMG addresses the challenge of real-time, stable, zero-shot sEMG intent detection by reframing EMG-based control as streaming segmentation. It introduces a masked multimodal transformer that jointly encodes EMG signals and intent tokens, trained with a dynamic masking strategy to align muscle activations with user intents and produce per-timestep predictions. The method, coupled with a bounded look-ahead aggregation, achieves state-of-the-art zero-shot performance and favorable latency-stability trade-offs on both standard EMG datasets and the new ROAM-EMG collection, without subject-specific calibration. By pretraining on large public EMG datasets and fine-tuning on a diverse, transition-rich dataset, ReactEMG demonstrates robust, plug-and-play online control applicable to wearable robotics, rehabilitation devices, and prosthetics, with open-source code and data to foster further research.

Abstract

Surface electromyography (sEMG) signals show promise for effective human-machine interfaces, particularly in rehabilitation and prosthetics. However, challenges remain in developing systems that respond quickly to user intent, produce stable flicker-free output suitable for device control, and work across different subjects without time-consuming calibration. In this work, we propose a framework for EMG-based intent detection that addresses these challenges. We cast intent detection as per-timestep segmentation of continuous sEMG streams, assigning labels as gestures unfold in real time. We introduce a masked modeling training strategy that aligns muscle activations with their corresponding user intents, enabling rapid onset detection and stable tracking of ongoing gestures. In evaluations against baseline methods, using metrics that capture accuracy, latency and stability for device control, our approach achieves state-of-the-art performance in zero-shot conditions. These results demonstrate its potential for wearable robotics and next-generation prosthetic systems. Our project website, video, code, and dataset are available at: https://reactemg.github.io/

ReactEMG: Stable, Low-Latency Intent Detection from sEMG via Masked Modeling

TL;DR

ReactEMG addresses the challenge of real-time, stable, zero-shot sEMG intent detection by reframing EMG-based control as streaming segmentation. It introduces a masked multimodal transformer that jointly encodes EMG signals and intent tokens, trained with a dynamic masking strategy to align muscle activations with user intents and produce per-timestep predictions. The method, coupled with a bounded look-ahead aggregation, achieves state-of-the-art zero-shot performance and favorable latency-stability trade-offs on both standard EMG datasets and the new ROAM-EMG collection, without subject-specific calibration. By pretraining on large public EMG datasets and fine-tuning on a diverse, transition-rich dataset, ReactEMG demonstrates robust, plug-and-play online control applicable to wearable robotics, rehabilitation devices, and prosthetics, with open-source code and data to foster further research.

Abstract

Surface electromyography (sEMG) signals show promise for effective human-machine interfaces, particularly in rehabilitation and prosthetics. However, challenges remain in developing systems that respond quickly to user intent, produce stable flicker-free output suitable for device control, and work across different subjects without time-consuming calibration. In this work, we propose a framework for EMG-based intent detection that addresses these challenges. We cast intent detection as per-timestep segmentation of continuous sEMG streams, assigning labels as gestures unfold in real time. We introduce a masked modeling training strategy that aligns muscle activations with their corresponding user intents, enabling rapid onset detection and stable tracking of ongoing gestures. In evaluations against baseline methods, using metrics that capture accuracy, latency and stability for device control, our approach achieves state-of-the-art performance in zero-shot conditions. These results demonstrate its potential for wearable robotics and next-generation prosthetic systems. Our project website, video, code, and dataset are available at: https://reactemg.github.io/

Paper Structure

This paper contains 14 sections, 3 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Design space for real-time EMG intent detection. Approaches that rely on full segment observability (top row) wait for an offset signal or a window long enough to cover the entire gesture before making a prediction. They typically assume each gesture is bounded by "Relax" or "Rest" states, or they segment the region that contains the non-rest gesture beforehand. Since dwell times vary, these methods introduce uncontrolled latency and does not match continuous use. Even dense variants (top‑right) inherit this dependence on known boundaries. With partial observability and a single label (left column), supervision becomes inconsistent when a window straddles a gesture transition---different assignment rules label nearly identical inputs differently---delaying switches and causing flicker that then requires smoothing. Our formulation (bottom‑right) treats intent detection as streaming segmentation: from the observed window it produces dense per‑timestep predictions, enabling immediate prediction and unambiguous supervision at transitions.
  • Figure 2: Model overview. EMG signals are mapped into an embedding space via a learnable linear projection, while intent tokens use a lookup-based embedding. Both modalities undergo dynamic span masking and receive modality-specific plus shared positional encodings. They are then concatenated and processed by a transformer encoder, after which outputs are split into EMG and intent branches. The EMG branch is optimized via MSE on masked timesteps, and the intent branch via cross-entropy on masked tokens. Losses are added before backpropagation.
  • Figure 3: Online inference with look-ahead horizon and reduced inference frequency. Five representative overlapping sliding windows are depicted, ending at time steps $t,t{+}10,\dots,t{+}50$ respectively. The small grey vertical bars within each window indicate the per-timestep logits that make up the average.
  • Figure 4: Transition accuracy Evaluation Protocol. For each ground‐truth transition from class $y_{\mathrm{old}}$ to $y_{\mathrm{new}}$, we define a reaction buffer (blue region) centered around the ground truth transition. The model must predict $y_{\mathrm{new}}$ at least once within this buffer and must not predict any other class. The subsequent maintenance period (red region) extends from the end of the reaction buffer to the start of the next reaction buffer. During this interval, the model must predict $y_{\mathrm{new}}$ at all times. A transition is scored as correct only if both reaction buffer and maintenance period conditions are satisfied.
  • Figure 5: Qualitative effect of fine-tuning on our dataset (ROAM-EMG). The public-data-only model produces brief, flickering gesture predictions that quickly return to rest---capturing onsets but not maintenance---whereas after fine-tuning the same architecture tracks both onset and sustained gestures and handles transitions that bypass rest.