WristPP: A Wrist-Worn System for Hand Pose And Pressure Estimation

Ziheng Xi; Zihang Ao; Yitao Wang; Mingeze Gao; Wanmei Zhang; Jianjiang Feng; Jie Zhou

WristPP: A Wrist-Worn System for Hand Pose And Pressure Estimation

Ziheng Xi, Zihang Ao, Yitao Wang, Mingeze Gao, Wanmei Zhang, Jianjiang Feng, Jie Zhou

TL;DR

Across three user studies, WristPP delivers touchpad-level efficiency in mid-air pointing and robust multi-finger pressure control on an uninstrumented desktop and enables higher success ratio and lower arm fatigue than head-mounted camera-based baselines than head-mounted camera-based baselines.

Abstract

Accurate 3D hand pose and pressure sensing is essential for immersive human-computer interaction, yet simultaneously achieving both in mobile scenarios remains a significant challenge. We present WristPP, a camera-based wrist-worn system that estimates 3D hand pose and per-vertex pressure from a single wide-FOV RGB frame in real time. A Vision Transformer (ViT) backbone with joint-aligned tokens predicts Hand-VQVAE codebook indices for mesh recovery, while an extrinsics-conditioned branch jointly estimates per-vertex pressure. On a self-collected dataset of 133,000 frames (20 subjects; 48 on-plane and 28 mid-air gestures), WristPP attains a Mean Per-Joint Position Error (MPJPE) of 2.9 mm, Contact IoU of 0.712, Volumetric IoU of 0.618, and foreground pressure MAE of 10.4 g. Across three user studies, WristPP delivers touchpad-level efficiency in mid-air pointing and robust multi-finger pressure control on an uninstrumented desktop. In a real-world large-display Whac-A-Mole task, WristPP also enables higher success ratio and lower arm fatigue than head-mounted camera-based baselines. These results position WristPP as an effective, mobile solution for versatile pose- and pressure-based interaction. Website: https://zhenqis123.github.io/WristPP/.

WristPP: A Wrist-Worn System for Hand Pose And Pressure Estimation

TL;DR

Abstract

Paper Structure (97 sections, 31 equations, 26 figures, 12 tables)

This paper contains 97 sections, 31 equations, 26 figures, 12 tables.

Introduction
Related Work
Environment-Mounted Sensing
Head-Mounted Sensing
Wrist-Proximal Sensing
Hardware Implementation
WristP2 Dataset
Data Capture Setup
Data Capture Environment
Lighting Diversity.
Background Control.
Surface Texture Diversity.
Participants
Preparatory Work
Planar Interaction Data Collection
...and 82 more sections

Figures (26)

Figure 1: Hardware design of WristP2. (a) System modules: right—180° FOV ultra-wide RGB camera module (28 $\times$ 25 mm), left—Raspberry Pi Zero 2 W with a battery-backed micro-UPS; (b) Stowed configuration worn on the wrist, with the overall thickness reduced to $\approx$ 10 mm for comfortable long-term wear; (c) In-use configuration with the camera deployed via a 90° magnetic fold-out hinge to provide a wrist-centric view from below the hand (height $\approx$ 28 mm); (d, e) Future integrated concept toward a smartwatch-style form factor.
Figure 2: Collection Scene. (a) Visualization of the collection environment; (b) Planar interaction data collection; (c) Mid-air hand pose data collection; (d) The GUI of data collection software; (e) The 21 markers attached at the positions of the hand joint points and the 4 markers attached at the four corners of the Sensel Morph.
Figure 3: (a) Distribution of shape parameters $\beta$ across participants. (b) Canonical hand-local coordinate system defined from anatomical markers.
Figure 4: Annotation pipeline overview. The pipeline consists of two parallel stages to generate ground-truth data. Top (Hand & Pressure Optimization): We fuse multi-modal inputs—including third-person Kinect RGB images, motion-capture markers, and tactile pressure maps—to jointly optimize the hand mesh geometry $V_l$ and the per-vertex pressure field $P_v$. A differentiable rendering module minimizes the discrepancy between simulated and observed pressure/depth maps. Bottom (Wrist Camera Calibration): Using the reconstructed 3D hand as a reference, we estimate the wrist camera extrinsics $[R_{cam}|t_{cam}]$ relative to the hand-local frame. This is achieved by minimizing the 3D-2D reprojection loss between the projected 3D joints and 2D keypoints detected by RTMPose from the raw wrist-camera inputs.
Figure 5: WristP2 pipeline. A ViT-based backbone produces image features that are queried by two sets of learnable tokens (pose- and pressure-tokens, both of length 21). The wrist-camera extrinsics are embedded and concatenated channel-wise to both token sets before cross-attention, yielding extrinsics-aware token features. The pooled token features are fed into task-specific heads: a codebook indices classification head (decoded by Hand--VQ--VAE decoder), and a pressure branch with two heads for contact classification and pressure regression.
...and 21 more figures

WristPP: A Wrist-Worn System for Hand Pose And Pressure Estimation

TL;DR

Abstract

WristPP: A Wrist-Worn System for Hand Pose And Pressure Estimation

Authors

TL;DR

Abstract

Table of Contents

Figures (26)