Table of Contents
Fetching ...

Re-evaluating Position and Velocity Decoding for Hand Pose Estimation with Surface Electromyography

Nima Hadidi, Johannes Lee, Ebrahim Feghhi, Michael Yuan, Jonathan C. Kao

TL;DR

Although position decoding models exhibit greater local jitter, a causal speed-adaptive filter preserves their accuracy advantage while yielding a more favorable smoothness-accuracy tradeoff than velocity decoding, establishing a new state of the art among published streaming-compatible models on this benchmark.

Abstract

Recent progress in real-time hand pose estimation from surface electromyography (sEMG) has been driven by the emg2pose benchmark, whose original baseline study concluded that velocity decoding outperforms position decoding in both reconstruction accuracy and trajectory smoothness. We revisit that conclusion under the original causal evaluation protocol. Using the same core architecture but a more stable training recipe, we show that position decoding models were previously underestimated because they are highly sensitive to a previously unswept decoder output scalar and can otherwise collapse into low movement solutions. Once this scalar is tuned, position decoding outperforms velocity decoding on the Tracking task across all three emg2pose generalization conditions, consistent with greater robustness to error accumulation. On the Regression task, the gap between position and velocity decoding is much smaller; instead, the largest gains come from multi-task training with Tracking, suggesting that the Regression objective alone does not sufficiently constrain the learned dynamics. Although position decoding models exhibit greater local jitter, a causal speed-adaptive filter preserves their accuracy advantage while yielding a more favorable smoothness-accuracy tradeoff than velocity decoding. Altogether, our results revise the original emg2pose modeling conclusions and establish a new state of the art among published streaming-compatible models on this benchmark.

Re-evaluating Position and Velocity Decoding for Hand Pose Estimation with Surface Electromyography

TL;DR

Although position decoding models exhibit greater local jitter, a causal speed-adaptive filter preserves their accuracy advantage while yielding a more favorable smoothness-accuracy tradeoff than velocity decoding, establishing a new state of the art among published streaming-compatible models on this benchmark.

Abstract

Recent progress in real-time hand pose estimation from surface electromyography (sEMG) has been driven by the emg2pose benchmark, whose original baseline study concluded that velocity decoding outperforms position decoding in both reconstruction accuracy and trajectory smoothness. We revisit that conclusion under the original causal evaluation protocol. Using the same core architecture but a more stable training recipe, we show that position decoding models were previously underestimated because they are highly sensitive to a previously unswept decoder output scalar and can otherwise collapse into low movement solutions. Once this scalar is tuned, position decoding outperforms velocity decoding on the Tracking task across all three emg2pose generalization conditions, consistent with greater robustness to error accumulation. On the Regression task, the gap between position and velocity decoding is much smaller; instead, the largest gains come from multi-task training with Tracking, suggesting that the Regression objective alone does not sufficiently constrain the learned dynamics. Although position decoding models exhibit greater local jitter, a causal speed-adaptive filter preserves their accuracy advantage while yielding a more favorable smoothness-accuracy tradeoff than velocity decoding. Altogether, our results revise the original emg2pose modeling conclusions and establish a new state of the art among published streaming-compatible models on this benchmark.
Paper Structure (16 sections, 11 equations, 4 figures, 2 tables)

This paper contains 16 sections, 11 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Validation trajectories reveal that inappropriate output scaling can trap position decoding models in low-movement solutions. Validation loss (top) and mean predicted speed (bottom) across epochs for position decoding models trained on Tracking (left) and Regression (right) with different decoder output scalars $s$. Thin lines show individual seeds. With the default scalar used by salter-2024 ($s=0.01$; red), training is often unstable and frequently converges to low-movement solutions, reflected in persistently low predicted speed. Increasing the scalar yields stable convergence for Tracking at $s=0.1$ and for Regression at $s=1$, indicating that position decoding performance is highly sensitive to output scaling.
  • Figure 2: Representative trajectory examples illustrate qualitative differences between model classes. Top row: on the Tracking task, position decoding models often appear more locally jagged than velocity decoding models. Bottom row: on the Regression task, multi-task position decoding models exhibit more coherent local dynamics than single-task position decoding models. Dashed black lines show ground-truth trajectories. Figure \ref{['fig:fig2']} quantifies these effects across sample windows and frequency bands.
  • Figure 3: Error dynamics across time and frequency reveal complementary effects of output parameterization and multi-task training.(a) Mean angular error as a function of time within sample windows for Tracking (solid) and Regression (dotted) under the user_stage, user, and stage generalization conditions. Tracking models begin from near-zero error and accumulate error over time, with position decoding models generally accumulating error more slowly than velocity decoding models. In contrast, Regression models generally improve over time as additional sEMG context becomes available, with multi-task models showing faster error reduction than single-task models. (b) Frequency-domain comparison of model residuals. For each model, we compute residuals relative to the ground-truth targets, take the absolute value of the FFT coefficients of those residuals, and then plot pairwise differences between models. Positive values indicate lower residual magnitude for the second model named in the comparison, while negative values indicate lower residual magnitude for the first. Top row: comparing velocity and position decoding models on Tracking shows that position decoding reduces low-frequency residuals but increases high-frequency residuals, consistent with improved robustness to drift but greater high-frequency jitter. Bottom row: comparing single-task and multi-task models on Regression shows that multi-task training reduces residuals broadly across frequencies.
  • Figure 4: A simple speed-adaptive causal filter improves smoothness while preserving accuracy. For each model, we sweep the filter parameter $\beta$ and plot accuracy against the mean speed of the resulting filtered trajectories under the user_stage, user, and stage generalization conditions. Star markers denote the unfiltered models. Top two rows: Tracking task results for angular error and landmark distance. Although unfiltered position decoding models have higher mean speed than velocity decoding models, filtering allows them to achieve lower error across the full observed range of mean speeds, effectively bypassing the apparent smoothness--accuracy tradeoff. Bottom two rows: Regression task results. Multi-task training substantially lowers mean speed relative to single-task training for both output parameterizations, and filtering further reduces speed while largely preserving accuracy. At matched smoothness levels, multi-task models consistently outperform single-task models, while position and velocity decoding models show near-parity.