Table of Contents
Fetching ...

ThermoHands: A Benchmark for 3D Hand Pose Estimation from Egocentric Thermal Images

Fangqiang Ding, Yunzhou Zhu, Xiangyu Wen, Gaowen Liu, Chris Xiaoxuan Lu

TL;DR

This work introduces ThermoHands, the first benchmark for egocentric 3D hand pose estimation from thermal images, featuring a multi-spectral, multi-view dataset with automated MANO-based ground-truth annotations across 28 subjects and diverse scenarios. It also proposes TherFormer, a dual-transformer baseline that leverages a mask-guided spatial transformer and a temporal transformer to capture spatio-temporal hand pose cues in thermal imagery. Experiments show near 1 cm annotation accuracy and demonstrate TherFormer’s lead performance on thermal data, highlighting thermal imaging’s robustness under challenging lighting and when hands are gloved. The dataset, code, and models are publicly released, enabling robust evaluation and fostering research into thermal-based 3D hand pose estimation with potential impact on XR, HRI, and related domains.

Abstract

Designing egocentric 3D hand pose estimation systems that can perform reliably in complex, real-world scenarios is crucial for downstream applications. Previous approaches using RGB or NIR imagery struggle in challenging conditions: RGB methods are susceptible to lighting variations and obstructions like handwear, while NIR techniques can be disrupted by sunlight or interference from other NIR-equipped devices. To address these limitations, we present ThermoHands, the first benchmark focused on thermal image-based egocentric 3D hand pose estimation, demonstrating the potential of thermal imaging to achieve robust performance under these conditions. The benchmark includes a multi-view and multi-spectral dataset collected from 28 subjects performing hand-object and hand-virtual interactions under diverse scenarios, accurately annotated with 3D hand poses through an automated process. We introduce a new baseline method, TherFormer, utilizing dual transformer modules for effective egocentric 3D hand pose estimation in thermal imagery. Our experimental results highlight TherFormer's leading performance and affirm thermal imaging's effectiveness in enabling robust 3D hand pose estimation in adverse conditions.

ThermoHands: A Benchmark for 3D Hand Pose Estimation from Egocentric Thermal Images

TL;DR

This work introduces ThermoHands, the first benchmark for egocentric 3D hand pose estimation from thermal images, featuring a multi-spectral, multi-view dataset with automated MANO-based ground-truth annotations across 28 subjects and diverse scenarios. It also proposes TherFormer, a dual-transformer baseline that leverages a mask-guided spatial transformer and a temporal transformer to capture spatio-temporal hand pose cues in thermal imagery. Experiments show near 1 cm annotation accuracy and demonstrate TherFormer’s lead performance on thermal data, highlighting thermal imaging’s robustness under challenging lighting and when hands are gloved. The dataset, code, and models are publicly released, enabling robust evaluation and fostering research into thermal-based 3D hand pose estimation with potential impact on XR, HRI, and related domains.

Abstract

Designing egocentric 3D hand pose estimation systems that can perform reliably in complex, real-world scenarios is crucial for downstream applications. Previous approaches using RGB or NIR imagery struggle in challenging conditions: RGB methods are susceptible to lighting variations and obstructions like handwear, while NIR techniques can be disrupted by sunlight or interference from other NIR-equipped devices. To address these limitations, we present ThermoHands, the first benchmark focused on thermal image-based egocentric 3D hand pose estimation, demonstrating the potential of thermal imaging to achieve robust performance under these conditions. The benchmark includes a multi-view and multi-spectral dataset collected from 28 subjects performing hand-object and hand-virtual interactions under diverse scenarios, accurately annotated with 3D hand poses through an automated process. We introduce a new baseline method, TherFormer, utilizing dual transformer modules for effective egocentric 3D hand pose estimation in thermal imagery. Our experimental results highlight TherFormer's leading performance and affirm thermal imaging's effectiveness in enabling robust 3D hand pose estimation in adverse conditions.
Paper Structure (22 sections, 12 equations, 9 figures, 6 tables)

This paper contains 22 sections, 12 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Data capture setup with the customized head-mounted sensor platform (HMSP) and exocentric platform recording multi-view multi-spectral images of two-hand actions performed by participants.
  • Figure 2: Design of the head-mounted sensor platform and sensor alignment.
  • Figure 3: Thermal calibration chessboard containing a black base board and multiple removable white cubes (a). By cooling down the base board, it shows similar patterns and allows automatic corner detection in all (b) RGB, (c) NIR and (d) thermal images.
  • Figure 4: Automatic annotation pipeline of 3D hand pose. We utilize the multi-view RGB and depth images as the input source and retrieve constraint information with off-the-shelf MediaPipe Hands mediapipe_hands_2020 and SAM kirillov2023segment. Various error terms are formulated to optimize the MANO parameters.
  • Figure 5: Overall Framework of TherFormer. Backbone features are input to the mask-guided spatial transformer and temporal transformer to enhance the spatial representation and temporal interaction. Spatio-temporal embeddings are fed into the pose head to regress the 3D hand pose.
  • ...and 4 more figures