Table of Contents
Fetching ...

WristPP: A Wrist-Worn System for Hand Pose And Pressure Estimation

Ziheng Xi, Zihang Ao, Yitao Wang, Mingeze Gao, Wanmei Zhang, Jianjiang Feng, Jie Zhou

TL;DR

Across three user studies, WristPP delivers touchpad-level efficiency in mid-air pointing and robust multi-finger pressure control on an uninstrumented desktop and enables higher success ratio and lower arm fatigue than head-mounted camera-based baselines than head-mounted camera-based baselines.

Abstract

Accurate 3D hand pose and pressure sensing is essential for immersive human-computer interaction, yet simultaneously achieving both in mobile scenarios remains a significant challenge. We present WristPP, a camera-based wrist-worn system that estimates 3D hand pose and per-vertex pressure from a single wide-FOV RGB frame in real time. A Vision Transformer (ViT) backbone with joint-aligned tokens predicts Hand-VQVAE codebook indices for mesh recovery, while an extrinsics-conditioned branch jointly estimates per-vertex pressure. On a self-collected dataset of 133,000 frames (20 subjects; 48 on-plane and 28 mid-air gestures), WristPP attains a Mean Per-Joint Position Error (MPJPE) of 2.9 mm, Contact IoU of 0.712, Volumetric IoU of 0.618, and foreground pressure MAE of 10.4 g. Across three user studies, WristPP delivers touchpad-level efficiency in mid-air pointing and robust multi-finger pressure control on an uninstrumented desktop. In a real-world large-display Whac-A-Mole task, WristPP also enables higher success ratio and lower arm fatigue than head-mounted camera-based baselines. These results position WristPP as an effective, mobile solution for versatile pose- and pressure-based interaction. Website: https://zhenqis123.github.io/WristPP/.

WristPP: A Wrist-Worn System for Hand Pose And Pressure Estimation

TL;DR

Across three user studies, WristPP delivers touchpad-level efficiency in mid-air pointing and robust multi-finger pressure control on an uninstrumented desktop and enables higher success ratio and lower arm fatigue than head-mounted camera-based baselines than head-mounted camera-based baselines.

Abstract

Accurate 3D hand pose and pressure sensing is essential for immersive human-computer interaction, yet simultaneously achieving both in mobile scenarios remains a significant challenge. We present WristPP, a camera-based wrist-worn system that estimates 3D hand pose and per-vertex pressure from a single wide-FOV RGB frame in real time. A Vision Transformer (ViT) backbone with joint-aligned tokens predicts Hand-VQVAE codebook indices for mesh recovery, while an extrinsics-conditioned branch jointly estimates per-vertex pressure. On a self-collected dataset of 133,000 frames (20 subjects; 48 on-plane and 28 mid-air gestures), WristPP attains a Mean Per-Joint Position Error (MPJPE) of 2.9 mm, Contact IoU of 0.712, Volumetric IoU of 0.618, and foreground pressure MAE of 10.4 g. Across three user studies, WristPP delivers touchpad-level efficiency in mid-air pointing and robust multi-finger pressure control on an uninstrumented desktop. In a real-world large-display Whac-A-Mole task, WristPP also enables higher success ratio and lower arm fatigue than head-mounted camera-based baselines. These results position WristPP as an effective, mobile solution for versatile pose- and pressure-based interaction. Website: https://zhenqis123.github.io/WristPP/.
Paper Structure (97 sections, 31 equations, 26 figures, 12 tables)

This paper contains 97 sections, 31 equations, 26 figures, 12 tables.

Figures (26)

  • Figure 1: Hardware design of WristP2. (a) System modules: right—180° FOV ultra-wide RGB camera module (28 $\times$ 25 mm), left—Raspberry Pi Zero 2 W with a battery-backed micro-UPS; (b) Stowed configuration worn on the wrist, with the overall thickness reduced to $\approx$ 10 mm for comfortable long-term wear; (c) In-use configuration with the camera deployed via a 90° magnetic fold-out hinge to provide a wrist-centric view from below the hand (height $\approx$ 28 mm); (d, e) Future integrated concept toward a smartwatch-style form factor.
  • Figure 2: Collection Scene. (a) Visualization of the collection environment; (b) Planar interaction data collection; (c) Mid-air hand pose data collection; (d) The GUI of data collection software; (e) The 21 markers attached at the positions of the hand joint points and the 4 markers attached at the four corners of the Sensel Morph.
  • Figure 3: (a) Distribution of shape parameters $\beta$ across participants. (b) Canonical hand-local coordinate system defined from anatomical markers.
  • Figure 4: Annotation pipeline overview. The pipeline consists of two parallel stages to generate ground-truth data. Top (Hand & Pressure Optimization): We fuse multi-modal inputs—including third-person Kinect RGB images, motion-capture markers, and tactile pressure maps—to jointly optimize the hand mesh geometry $V_l$ and the per-vertex pressure field $P_v$. A differentiable rendering module minimizes the discrepancy between simulated and observed pressure/depth maps. Bottom (Wrist Camera Calibration): Using the reconstructed 3D hand as a reference, we estimate the wrist camera extrinsics $[R_{cam}|t_{cam}]$ relative to the hand-local frame. This is achieved by minimizing the 3D-2D reprojection loss between the projected 3D joints and 2D keypoints detected by RTMPose from the raw wrist-camera inputs.
  • Figure 5: WristP2 pipeline. A ViT-based backbone produces image features that are queried by two sets of learnable tokens (pose- and pressure-tokens, both of length 21). The wrist-camera extrinsics are embedded and concatenated channel-wise to both token sets before cross-attention, yielding extrinsics-aware token features. The pooled token features are fed into task-specific heads: a codebook indices classification head (decoded by Hand--VQ--VAE decoder), and a pressure branch with two heads for contact classification and pressure regression.
  • ...and 21 more figures