Table of Contents
Fetching ...

emg2pose: A Large and Diverse Benchmark for Surface Electromyographic Hand Pose Estimation

Sasha Salter, Richard Warren, Collin Schlager, Adrian Spurr, Shangchen Han, Rohin Bhasin, Yujun Cai, Peter Walkington, Anuoluwapo Bolarinwa, Robert Wang, Nathan Danielson, Josh Merel, Eftychios Pnevmatikakis, Jesse Marshall

TL;DR

The paper introduces emg2pose, the largest public dataset for wrist sEMG-based hand pose estimation, combining 193 participants, 370 hours, 29 gesture stages, 16-channel sEMG at 2 kHz, and 26-camera mocap ground-truth. It defines two tasks—pose regression and tracking—along with held-out evaluation settings across unseen users, stages, and user-stage combinations, and provides three competitive baselines including a velocity-based model, vemg2pose. Results show vemg2pose achieving the strongest generalization performance, with analyses revealing how dataset scale, stage diversity, and anatomical variability influence accuracy. The benchmark enables systematic study of generalized sEMG-to-pose decoding and aims to accelerate robust, non-vision-based hand control for AR/VR and related applications. Overall, emg2pose establishes a valuable platform for advancing biosignal-driven human-computer interfaces and highlights key directions for overcoming generalization challenges.

Abstract

Hands are the primary means through which humans interact with the world. Reliable and always-available hand pose inference could yield new and intuitive control schemes for human-computer interactions, particularly in virtual and augmented reality. Computer vision is effective but requires one or multiple cameras and can struggle with occlusions, limited field of view, and poor lighting. Wearable wrist-based surface electromyography (sEMG) presents a promising alternative as an always-available modality sensing muscle activities that drive hand motion. However, sEMG signals are strongly dependent on user anatomy and sensor placement, and existing sEMG models have required hundreds of users and device placements to effectively generalize. To facilitate progress on sEMG pose inference, we introduce the emg2pose benchmark, the largest publicly available dataset of high-quality hand pose labels and wrist sEMG recordings. emg2pose contains 2kHz, 16 channel sEMG and pose labels from a 26-camera motion capture rig for 193 users, 370 hours, and 29 stages with diverse gestures - a scale comparable to vision-based hand pose datasets. We provide competitive baselines and challenging tasks evaluating real-world generalization scenarios: held-out users, sensor placements, and stages. emg2pose provides the machine learning community a platform for exploring complex generalization problems, holding potential to significantly enhance the development of sEMG-based human-computer interactions.

emg2pose: A Large and Diverse Benchmark for Surface Electromyographic Hand Pose Estimation

TL;DR

The paper introduces emg2pose, the largest public dataset for wrist sEMG-based hand pose estimation, combining 193 participants, 370 hours, 29 gesture stages, 16-channel sEMG at 2 kHz, and 26-camera mocap ground-truth. It defines two tasks—pose regression and tracking—along with held-out evaluation settings across unseen users, stages, and user-stage combinations, and provides three competitive baselines including a velocity-based model, vemg2pose. Results show vemg2pose achieving the strongest generalization performance, with analyses revealing how dataset scale, stage diversity, and anatomical variability influence accuracy. The benchmark enables systematic study of generalized sEMG-to-pose decoding and aims to accelerate robust, non-vision-based hand control for AR/VR and related applications. Overall, emg2pose establishes a valuable platform for advancing biosignal-driven human-computer interfaces and highlights key directions for overcoming generalization challenges.

Abstract

Hands are the primary means through which humans interact with the world. Reliable and always-available hand pose inference could yield new and intuitive control schemes for human-computer interactions, particularly in virtual and augmented reality. Computer vision is effective but requires one or multiple cameras and can struggle with occlusions, limited field of view, and poor lighting. Wearable wrist-based surface electromyography (sEMG) presents a promising alternative as an always-available modality sensing muscle activities that drive hand motion. However, sEMG signals are strongly dependent on user anatomy and sensor placement, and existing sEMG models have required hundreds of users and device placements to effectively generalize. To facilitate progress on sEMG pose inference, we introduce the emg2pose benchmark, the largest publicly available dataset of high-quality hand pose labels and wrist sEMG recordings. emg2pose contains 2kHz, 16 channel sEMG and pose labels from a 26-camera motion capture rig for 193 users, 370 hours, and 29 stages with diverse gestures - a scale comparable to vision-based hand pose datasets. We provide competitive baselines and challenging tasks evaluating real-world generalization scenarios: held-out users, sensor placements, and stages. emg2pose provides the machine learning community a platform for exploring complex generalization problems, holding potential to significantly enhance the development of sEMG-based human-computer interactions.

Paper Structure

This paper contains 40 sections, 3 equations, 15 figures, 9 tables.

Figures (15)

  • Figure 1: We introduce the emg2pose dataset and benchmark to facilitate the development of pose estimation models from sEMG. Our vemg2pose model is capable of estimating in real-time hand pose (lower) from held-out users wearing an sEMG wristband (top). See text for further details.
  • Figure 2: Dataset composition: a) sEMG-RD wrist-band and motion capture marker (white dots) setup. b) Dataset breakdown. i) Users are prompted to perform a sequence of movement types (gestures), such as counting up and down. sEMG and poses are recorded simultaneously. ii) Groups of specific gesture types comprise a stage, such as counting. Stages are partitioned into train/val/test splits (see \ref{['sec: held-out settings']}). Our dataset consists of $29$ diverse stages. iii) Each of the $193$ users perform various stages, donning on-and-off the wrist band. In total we record $370$ hours of data.
  • Figure 3: vemg2pose tracking performance break down by stage and generalization condition. Distributions are over users. Note the variability in performance across stages. Each box shows the median and interquartile range (IQR), and whiskers show the minimum and maximum values that are within 1.5 times the IQR of the lower and upper quartiles.
  • Figure 4: vemg2pose tracking results with/without occlusion (left) and physical interactions (right). Distributions are over users. See \ref{['sec: challenging_stages_supp']} for more details.
  • Figure 5: Median percentile held-out user and stage (Counting2). Top: motion capture; bottom: vemg2pose, tracking predictions. Clips unroll evenly left-to-right over a $2$ second segment.
  • ...and 10 more figures