Table of Contents
Fetching ...

PAD-Hand: Physics-Aware Diffusion for Hand Motion Recovery

Elkhan Ismayilzada, Yufei Zhang, Zijun Cui

Abstract

Significant advancements made in reconstructing hands from images have delivered accurate single-frame estimates, yet they often lack physics consistency and provide no notion of how confidently the motion satisfies physics. In this paper, we propose a novel physics-aware conditional diffusion framework that refines noisy pose sequences into physically plausible hand motion while estimating the physics variance in motion estimates. Building on a MeshCNN-Transformer backbone, we formulate Euler-Lagrange dynamics for articulated hands. Unlike prior works that enforce zero residuals, we treat the resulting dynamic residuals as virtual observables to more effectively integrate physics. Through a last-layer Laplace approximation, our method produces per-joint, per-time variances that measure physics consistency and offers interpretable variance maps indicating where physical consistency weakens. Experiments on two well-known hand datasets show consistent gains over strong image-based initializations and competitive video-based methods. Qualitative results confirm that our variance estimations are aligned with the physical plausibility of the motion in image-based estimates.

PAD-Hand: Physics-Aware Diffusion for Hand Motion Recovery

Abstract

Significant advancements made in reconstructing hands from images have delivered accurate single-frame estimates, yet they often lack physics consistency and provide no notion of how confidently the motion satisfies physics. In this paper, we propose a novel physics-aware conditional diffusion framework that refines noisy pose sequences into physically plausible hand motion while estimating the physics variance in motion estimates. Building on a MeshCNN-Transformer backbone, we formulate Euler-Lagrange dynamics for articulated hands. Unlike prior works that enforce zero residuals, we treat the resulting dynamic residuals as virtual observables to more effectively integrate physics. Through a last-layer Laplace approximation, our method produces per-joint, per-time variances that measure physics consistency and offers interpretable variance maps indicating where physical consistency weakens. Experiments on two well-known hand datasets show consistent gains over strong image-based initializations and competitive video-based methods. Qualitative results confirm that our variance estimations are aligned with the physical plausibility of the motion in image-based estimates.

Paper Structure

This paper contains 23 sections, 28 equations, 6 figures, 11 tables, 1 algorithm.

Figures (6)

  • Figure 1: Refined motion estimates by PAD-Hand with dynamic variance.Top: Image-based estimates (left) are refined by our model (PAD-Hand) (right) to enforce temporal and physics consistency. Bottom: Joint-level (left) and mesh-level (right) variance maps concentrate on frames/regions where the image-based motion estimate is unreliable (highlighted in red), aligning high variance with poor motion estimates. The color bar shows normalized variance (low to high).
  • Figure 2: Overview of PAD-Hand. A sequence of images $\mathcal{I}_{1:T}$ is passed through an image-based pose estimator to obtain per-frame pose $\theta_{1:T}$ and the average shape $\beta_{avg}$ estimates. The pose estimates are then refined via a diffusion process to obtain temporally coherent motion. Simultaneously, we propagate the variance at each diffusion step starting from delta Dirac distribution at diffusion step $N$ to obtain per-frame dynamic variance estimates. At each diffusion step, backbone predicts clean motion $\hat{x}_{1:T}$ which is supervised with data-driven loss $\mathcal{L}_{data}$ and physics-driven loss $\mathcal{L}_{EL}$ during training.
  • Figure 3: Backbone architecture. At diffusion step $n$, the current pose sequence ${x}^{n}_{1:T}$ and image-based estimates ${y}_{1:T}$ are converted to meshes and encoded by MeshCNN chen2022mobrecon, while an MLP encodes $n$. A Transformer encoder-decoder fuses these features, and an LLLA head predicts the refined pose sequence $\hat{{x}}_{1:T}$ and its variance $\mathrm{Var}(\hat{{x}}_{1:T})$.
  • Figure 4: Refined motion estimates by PAD-Hand with dynamic variance on DexYCB. We visualize three representative sequences (I–III). In each block, row (a) compares the original image-based motion estimates to the trajectories refined by PAD-Hand, while row (b) shows the corresponding variance estimations in terms of joint-level and mesh-level dynamic variance. The red boxes highlight frames where the image-based estimates exhibit strong jitter.
  • Figure 5: Distribution of dynamic variances for PAD-Hand. Bar color encodes the mean Euler–Lagrange residual within each variance bin (blue is low, red is high). Higher variance bins coincide with larger residuals, indicating that the model’s uncertainty aligns with physics violations.
  • ...and 1 more figures