Complementing Event Streams and RGB Frames for Hand Mesh Reconstruction

Jianping Jiang; Xinyu Zhou; Bingxuan Wang; Xiaoming Deng; Chao Xu; Boxin Shi

Complementing Event Streams and RGB Frames for Hand Mesh Reconstruction

Jianping Jiang, Xinyu Zhou, Bingxuan Wang, Xiaoming Deng, Chao Xu, Boxin Shi

TL;DR

The paper addresses robust 3D hand mesh reconstruction under challenging illumination and motion by fusing asynchronous event streams with RGB frames. It introduces EvRGBHand, a transformer-based framework with EvImHandNet for spatial alignment, complementary fusion, and temporal attention, and EvRGBDegrader for challenging-scene generalization. The approach leverages MANO hand models and cross-modal supervision, trained on real EvRealHands and synthetic InterHand2.6M-derived data, and achieves superior accuracy and efficiency compared with RGB-only, event-only, and naive fusion baselines. The work demonstrates strong indoor-to-outdoor generalization, cross-camera adaptability to other event cameras, and notable reductions in computational cost, highlighting the practical viability of multi-modal hand tracking in varied real-world settings.

Abstract

Reliable hand mesh reconstruction (HMR) from commonly-used color and depth sensors is challenging especially under scenarios with varied illuminations and fast motions. Event camera is a highly promising alternative for its high dynamic range and dense temporal resolution properties, but it lacks key texture appearance for hand mesh reconstruction. In this paper, we propose EvRGBHand -- the first approach for 3D hand mesh reconstruction with an event camera and an RGB camera compensating for each other. By fusing two modalities of data across time, space, and information dimensions,EvRGBHand can tackle overexposure and motion blur issues in RGB-based HMR and foreground scarcity and background overflow issues in event-based HMR. We further propose EvRGBDegrader, which allows our model to generalize effectively in challenging scenes, even when trained solely on standard scenes, thus reducing data acquisition costs. Experiments on real-world data demonstrate that EvRGBHand can effectively solve the challenging issues when using either type of camera alone via retaining the merits of both, and shows the potential of generalization to outdoor scenes and another type of event camera.

Complementing Event Streams and RGB Frames for Hand Mesh Reconstruction

TL;DR

Abstract

Paper Structure (43 sections, 11 equations, 15 figures, 4 tables)

This paper contains 43 sections, 11 equations, 15 figures, 4 tables.

Introduction
Related work
RGB-based HMR
Event-based HMR
Event-image Fusion
Method
Preliminaries
EvImHandNet
Spatial alignment.
Complementary fusion.
Temporal attention.
EvRGBDegrader
Training
Datasets and metrics
Real-world data
...and 28 more sections

Figures (15)

Figure 1: Due to the differences in RGB camera and event camera imaging mechanisms, it is promising to make complementary use of both modalities of data to achieve robust hand mesh reconstruction and tackle their respective challenging issues listed at the top. The arrows between the first and second rows point to the compensated data domain using the data from their tails.
Figure 2: Overview of our pipeline. During training, we first generate various challenging scene data from normal scene sequences via EvRGBDegrader. Then we achieve spatial alignment of the event and image features using the Deformable module with temporal motion clues. Once aligned, we feed these features subsequently to complementary fusion module (detailed architecture in \ref{['fig: fusion module']}) for scene-aware fusion, the transformer encoder to learn non-local correlations and mapping them to the latent hand space. We then apply temporal attention on context hand features to leverage the spatial-temporal consistency of hand motions. Finally, the mesh decoder maps the hand features into the 3D coordinates of hand vertices and joints. In evaluation, we deactivate EvRGBDegrader.
Figure 3: Detailed architecture of complementary fusion module.
Figure 4: Visualization of train-evaluation gap and EvRGBDegrader. For each triplet from left to right, we show original data, degraded data, real data with challenging issues.
Figure 5: Visualization for events and image descriptor vectors by t-SNE. The descriptor vector has four dimensions: image sharpness, image brightness, and the means of positive and negative polarity events.
...and 10 more figures

Complementing Event Streams and RGB Frames for Hand Mesh Reconstruction

TL;DR

Abstract

Complementing Event Streams and RGB Frames for Hand Mesh Reconstruction

Authors

TL;DR

Abstract

Table of Contents

Figures (15)