Table of Contents
Fetching ...

AirGlove: Exploring Egocentric 3D Hand Tracking and Appearance Generalization for Sensing Gloves

Wenhui Cui, Ziyi Kou, Chuan Qin, Ergys Ristani, Li Guan

TL;DR

The paper addresses the deterioration of vision-based hand tracking when transitioning from bare hands to sensing gloves due to appearance gaps. It introduces AirGlove, a framework with a Temporal-Aware Deep Visual Network and an Adversarial Appearance-Invariant Discriminator to learn appearance-agnostic glove representations, trained with an energy-based adversarial objective and alternating optimization. Through a multi-sensing glove dataset, the authors show substantial degradation of bare-hand models on gloves and demonstrate that AirGlove generalizes to unseen glove designs, yielding significant performance gains, especially in low-data regimes. The approach offers a practical path to robust glove tracking without extensive glove-specific annotations, with strong implications for teleoperation and glove-enabled robotics.

Abstract

Sensing gloves have become important tools for teleoperation and robotic policy learning as they are able to provide rich signals like speed, acceleration and tactile feedback. A common approach to track gloved hands is to directly use the sensor signals (e.g., angular velocity, gravity orientation) to estimate 3D hand poses. However, sensor-based tracking can be restrictive in practice as the accuracy is often impacted by sensor signal and calibration quality. Recent advances in vision-based approaches have achieved strong performance on human hands via large-scale pre-training, but their performance on gloved hands with distinct visual appearances remains underexplored. In this work, we present the first systematic evaluation of vision-based hand tracking models on gloved hands under both zero-shot and fine-tuning setups. Our analysis shows that existing bare-hand models suffer from substantial performance degradation on sensing gloves due to large appearance gap between bare-hand and glove designs. We therefore propose AirGlove, which leverages existing gloves to generalize the learned glove representations towards new gloves with limited data. Experiments with multiple sensing gloves show that AirGlove effectively generalizes the hand pose models to new glove designs and achieves a significant performance boost over the compared schemes.

AirGlove: Exploring Egocentric 3D Hand Tracking and Appearance Generalization for Sensing Gloves

TL;DR

The paper addresses the deterioration of vision-based hand tracking when transitioning from bare hands to sensing gloves due to appearance gaps. It introduces AirGlove, a framework with a Temporal-Aware Deep Visual Network and an Adversarial Appearance-Invariant Discriminator to learn appearance-agnostic glove representations, trained with an energy-based adversarial objective and alternating optimization. Through a multi-sensing glove dataset, the authors show substantial degradation of bare-hand models on gloves and demonstrate that AirGlove generalizes to unseen glove designs, yielding significant performance gains, especially in low-data regimes. The approach offers a practical path to robust glove tracking without extensive glove-specific annotations, with strong implications for teleoperation and glove-enabled robotics.

Abstract

Sensing gloves have become important tools for teleoperation and robotic policy learning as they are able to provide rich signals like speed, acceleration and tactile feedback. A common approach to track gloved hands is to directly use the sensor signals (e.g., angular velocity, gravity orientation) to estimate 3D hand poses. However, sensor-based tracking can be restrictive in practice as the accuracy is often impacted by sensor signal and calibration quality. Recent advances in vision-based approaches have achieved strong performance on human hands via large-scale pre-training, but their performance on gloved hands with distinct visual appearances remains underexplored. In this work, we present the first systematic evaluation of vision-based hand tracking models on gloved hands under both zero-shot and fine-tuning setups. Our analysis shows that existing bare-hand models suffer from substantial performance degradation on sensing gloves due to large appearance gap between bare-hand and glove designs. We therefore propose AirGlove, which leverages existing gloves to generalize the learned glove representations towards new gloves with limited data. Experiments with multiple sensing gloves show that AirGlove effectively generalizes the hand pose models to new glove designs and achieves a significant performance boost over the compared schemes.
Paper Structure (8 sections, 3 equations, 5 figures, 2 tables)

This paper contains 8 sections, 3 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of Our Study. (Left) We quantitatively explore the potential degradation of vision-based hand tracking models on sensing gloves. (Right) We propose an appearance-invariant representation learning framework for glove generalization, which leverages adversarial learning on existing sensing glove data to enhance gloved hand tracking performance on unseen glove designs.
  • Figure 2: Overview of AirGlove. The temporal-aware encoder extracts visual representations from egocentric videos, followed by the 3D pose decoder for pose estimation. The adversarial appearance discriminator iteratively regulates the glove representations to derive appearance-invariant features.
  • Figure 3: Evaluation of AirGlove on sensing glove datasets. AirGlove achieves superior tracking performance compared to all baselines across different sensing gloves (row-wise) and evaluation metrics (column-wise). Best viewed in color.
  • Figure 4: Visualization for AirGlove. Compared to the baseline MEgATrack (red), AirGlove (yellow) generates predictions that are better aligned with the ground-truth (green).
  • Figure 5: Glove classification results and t-SNE visualization of glove representations. With $\mathcal{L}_\mathrm{adv}$ vs. without, the model learns features that cannot differentiate glove appearances, leading to pose representations without appearance bias.