Table of Contents
Fetching ...

WatchHand: Enabling Continuous Hand Pose Tracking On Off-the-Shelf Smartwatches

Jiwan Kim, Chi-Jung Lee, Hohurn Jung, Tianhong Catherine Yu, Ruidong Zhang, Ian Oakley, Cheng Zhang

TL;DR

WatchHand is presented, the first continuous 3D hand pose tracking system implemented on off-the-shelf smartwatches using only their built-in speaker and microphone, and lowers the barrier to smartwatch-based hand tracking by eliminating additional hardware while enabling robust, always-available interactions on millions of existing devices.

Abstract

Tracking hand poses on wrist-wearables enables rich, expressive interactions, yet remains unavailable on commercial smartwatches, as prior implementations rely on external sensors or custom hardware, limiting their real-world applicability. To address this, we present WatchHand, the first continuous 3D hand pose tracking system implemented on off-the-shelf smartwatches using only their built-in speaker and microphone. WatchHand emits inaudible frequency-modulated continuous waves and captures their reflections from the hand. These acoustic signals are processed by a deep-learning model that estimates 3D hand poses for 20 finger joints. We evaluate WatchHand across diverse real-world conditions -- multiple smartwatch models, wearing-hands, body postures, noise conditions, pose-variation protocols -- and achieve a mean per-joint position error of 7.87 mm in cross-session tests with device remounting. Although performance drops for unseen users or gestures, the model adapts effectively with lightweight fine-tuning on small amounts of data. Overall, WatchHand lowers the barrier to smartwatch-based hand tracking by eliminating additional hardware while enabling robust, always-available interactions on millions of existing devices.

WatchHand: Enabling Continuous Hand Pose Tracking On Off-the-Shelf Smartwatches

TL;DR

WatchHand is presented, the first continuous 3D hand pose tracking system implemented on off-the-shelf smartwatches using only their built-in speaker and microphone, and lowers the barrier to smartwatch-based hand tracking by eliminating additional hardware while enabling robust, always-available interactions on millions of existing devices.

Abstract

Tracking hand poses on wrist-wearables enables rich, expressive interactions, yet remains unavailable on commercial smartwatches, as prior implementations rely on external sensors or custom hardware, limiting their real-world applicability. To address this, we present WatchHand, the first continuous 3D hand pose tracking system implemented on off-the-shelf smartwatches using only their built-in speaker and microphone. WatchHand emits inaudible frequency-modulated continuous waves and captures their reflections from the hand. These acoustic signals are processed by a deep-learning model that estimates 3D hand poses for 20 finger joints. We evaluate WatchHand across diverse real-world conditions -- multiple smartwatch models, wearing-hands, body postures, noise conditions, pose-variation protocols -- and achieve a mean per-joint position error of 7.87 mm in cross-session tests with device remounting. Although performance drops for unseen users or gestures, the model adapts effectively with lightweight fine-tuning on small amounts of data. Overall, WatchHand lowers the barrier to smartwatch-based hand tracking by eliminating additional hardware while enabling robust, always-available interactions on millions of existing devices.
Paper Structure (63 sections, 1 equation, 12 figures, 1 table)

This paper contains 63 sections, 1 equation, 12 figures, 1 table.

Figures (12)

  • Figure 1: (A) Three different COTS smartwatches we evaluated and the physical locations of their built-in speaker and microphone, and (B) system evaluation setup: A smartwatch is worn on a prop hand while a stepper motor–driven linear stage moves a flat plate back and forth within typical finger movement ranges (10–15 cm). To evaluate different angles, the prop hand was rotated to various orientations (e.g., 0°, ±30°, ±60°, ±90°, and perpendicular) relative to the direction of plate motion.
  • Figure 2: Visual examples of echo profile calibration across three commercial smartwatches. Sliding-window cross-correlation peak correction removes peak misalignment (red triangles), while periodic drift calibration mitigates repeating noise artifacts (red dashed boxes). Hand pose transition timings are marked (yellow triangles). Together, these calibration steps produce cleaner and more temporally stable echo profiles during dynamic hand pose transitions.
  • Figure 3: Captured original and differential echo profiles across different COTS smartwatches (Galaxy, Xiaomi, and Pixel) (A) and varying angles between the plate's motion and line toward the hand (B) during repetitive back-and-forth movements toward and away from the smartwatch.
  • Figure 4: Processed original and differential echo profiles (see Section \ref{['acoustic-preprocessing']}) and bandpass-filtered IMU sensor data in the 32-100 Hz range (see Section \ref{['imu-preprocessing']}) captured from a single user using a COTS smartwatch (Galaxy Watch 7) during different hand postures.
  • Figure 5: (A) MediapPipe-based 3D hand joint annotation used as ground truth, showing 20 labeled landmarks from the root to each fingertip. (B) Study setup in which the laptop’s front camera faces the participant’s palm to capture ground truth without optical occlusion during hand pose demonstrations.
  • ...and 7 more figures