Table of Contents
Fetching ...

V-Hands: Touchscreen-based Hand Tracking for Remote Whiteboard Interaction

Xinshuang Liu, Yizhong Zhang, Xin Tong

TL;DR

A deep neural network is developed to identify hands and infer hand joint positions from capacitive frames, and then recover 3D hand poses from the hand-joint positions via a constrained inverse kinematic solver.

Abstract

In whiteboard-based remote communication, the seamless integration of drawn content and hand-screen interactions is essential for an immersive user experience. Previous methods either require bulky device setups for capturing hand gestures or fail to accurately track the hand poses from capacitive images. In this paper, we present a real-time method for precise tracking 3D poses of both hands from capacitive video frames. To this end, we develop a deep neural network to identify hands and infer hand joint positions from capacitive frames, and then recover 3D hand poses from the hand-joint positions via a constrained inverse kinematic solver. Additionally, we design a device setup for capturing high-quality hand-screen interaction data and obtained a more accurate synchronized capacitive video and hand pose dataset. Our method improves the accuracy and stability of 3D hand tracking for capacitive frames while maintaining a compact device setup for remote communication. We validate our scheme design and its superior performance on 3D hand pose tracking and demonstrate the effectiveness of our method in whiteboard-based remote communication. Our code, model, and dataset are available at https://V-Hands.github.io.

V-Hands: Touchscreen-based Hand Tracking for Remote Whiteboard Interaction

TL;DR

A deep neural network is developed to identify hands and infer hand joint positions from capacitive frames, and then recover 3D hand poses from the hand-joint positions via a constrained inverse kinematic solver.

Abstract

In whiteboard-based remote communication, the seamless integration of drawn content and hand-screen interactions is essential for an immersive user experience. Previous methods either require bulky device setups for capturing hand gestures or fail to accurately track the hand poses from capacitive images. In this paper, we present a real-time method for precise tracking 3D poses of both hands from capacitive video frames. To this end, we develop a deep neural network to identify hands and infer hand joint positions from capacitive frames, and then recover 3D hand poses from the hand-joint positions via a constrained inverse kinematic solver. Additionally, we design a device setup for capturing high-quality hand-screen interaction data and obtained a more accurate synchronized capacitive video and hand pose dataset. Our method improves the accuracy and stability of 3D hand tracking for capacitive frames while maintaining a compact device setup for remote communication. We validate our scheme design and its superior performance on 3D hand pose tracking and demonstrate the effectiveness of our method in whiteboard-based remote communication. Our code, model, and dataset are available at https://V-Hands.github.io.
Paper Structure (37 sections, 11 equations, 10 figures, 4 tables)

This paper contains 37 sections, 11 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: An overview of our method. Given the current capacitive frame $I_{t}$ and a state latent $S_{t-1}$ aggregated from previous frames, our joint position estimator infers the 3D joint positions $J_{t}$ of the two hands in the current frame and subsequently updates the state latent to $S_{t}$. Following this, a constrained inverse kinematic (IK) solver is employed to reconstruct the 3D hand pose from the 3D hand joint positions and subsequently transform the 3D hand meshes to the current pose.
  • Figure 2: The network architecture of the joint position estimator.
  • Figure 3: The device setup of our system for capturing ground-truth hand poses and corresponding capacitive images. (a) Our system consists of one capacitive touchscreen and nine RGB cameras, which are synchronized to capture RGB images of hands from different views along with the associated capacitive images. (b) The nine RGB images of two hands on the touchscreen. (c) The capacitive images of the two hands, captured by the touchscreen at the same time instance of the RGB images.
  • Figure 4: Depiction of the hand gestures in our dataset. During predefined gestures, participants perform hand movements within each category as shown in (a). For free hand movement, participants interchange freely among predefined categories and perform spontaneous hand gestures as shown in (b).
  • Figure 5: Visual comparison of the projected joint error $\bf{EPE_{xy}}$ between TouchPose and our method across various hand poses (best viewed on screen). By projecting the joints predicted by our method (red) onto the touch image and comparing them to TouchPose (blue), it is evident that the hand joints predicted by our method exhibit superior alignment with the touch image.
  • ...and 5 more figures