Table of Contents
Fetching ...

OmniHands: Towards Robust 4D Hand Mesh Recovery via A Versatile Transformer

Dixuan Lin, Yuxiang Zhang, Mengcheng Li, Yebin Liu, Wei Jing, Qi Yan, Qianying Wang, Hongwen Zhang

TL;DR

OmniHands tackles robust 4D hand mesh recovery across diverse input forms by introducing RAT, which encodes relative hand relationships into hand tokens, and FIR, a 4D context-aware reasoning module that fuses spatial and temporal information to decode 3D MANO hand meshes and relative movements. The method unifies single-hand and two-hand inputs, as well as monocular, temporal, and multi-view data, through a transformer-based architecture with cross-hand tokenization and 4D interaction modeling. Comprehensive experiments on mixed datasets and in-the-wild scenarios demonstrate state-of-the-art accuracy and temporal stability, with ablations validating the contribution of RAT and FIR. While computationally intensive, OmniHands offers a versatile, calibration-free solution for interactive hand reconstruction applicable to AR/VR, HCI, and embodied AI contexts.

Abstract

In this paper, we introduce OmniHands, a universal approach to recovering interactive hand meshes and their relative movement from monocular or multi-view inputs. Our approach addresses two major limitations of previous methods: lacking a unified solution for handling various hand image inputs and neglecting the positional relationship of two hands within images. To overcome these challenges, we develop a universal architecture with novel tokenization and contextual feature fusion strategies, capable of adapting to a variety of tasks. Specifically, we propose a Relation-aware Two-Hand Tokenization (RAT) method to embed positional relation information into the hand tokens. In this way, our network can handle both single-hand and two-hand inputs and explicitly leverage relative hand positions, facilitating the reconstruction of intricate hand interactions in real-world scenarios. As such tokenization indicates the relative relationship of two hands, it also supports more effective feature fusion. To this end, we further develop a 4D Interaction Reasoning (FIR) module to fuse hand tokens in 4D with attention and decode them into 3D hand meshes and relative temporal movements. The efficacy of our approach is validated on several benchmark datasets. The results on in-the-wild videos and real-world scenarios demonstrate the superior performances of our approach for interactive hand reconstruction. More video results can be found on the project page: https://OmniHand.github.io.

OmniHands: Towards Robust 4D Hand Mesh Recovery via A Versatile Transformer

TL;DR

OmniHands tackles robust 4D hand mesh recovery across diverse input forms by introducing RAT, which encodes relative hand relationships into hand tokens, and FIR, a 4D context-aware reasoning module that fuses spatial and temporal information to decode 3D MANO hand meshes and relative movements. The method unifies single-hand and two-hand inputs, as well as monocular, temporal, and multi-view data, through a transformer-based architecture with cross-hand tokenization and 4D interaction modeling. Comprehensive experiments on mixed datasets and in-the-wild scenarios demonstrate state-of-the-art accuracy and temporal stability, with ablations validating the contribution of RAT and FIR. While computationally intensive, OmniHands offers a versatile, calibration-free solution for interactive hand reconstruction applicable to AR/VR, HCI, and embodied AI contexts.

Abstract

In this paper, we introduce OmniHands, a universal approach to recovering interactive hand meshes and their relative movement from monocular or multi-view inputs. Our approach addresses two major limitations of previous methods: lacking a unified solution for handling various hand image inputs and neglecting the positional relationship of two hands within images. To overcome these challenges, we develop a universal architecture with novel tokenization and contextual feature fusion strategies, capable of adapting to a variety of tasks. Specifically, we propose a Relation-aware Two-Hand Tokenization (RAT) method to embed positional relation information into the hand tokens. In this way, our network can handle both single-hand and two-hand inputs and explicitly leverage relative hand positions, facilitating the reconstruction of intricate hand interactions in real-world scenarios. As such tokenization indicates the relative relationship of two hands, it also supports more effective feature fusion. To this end, we further develop a 4D Interaction Reasoning (FIR) module to fuse hand tokens in 4D with attention and decode them into 3D hand meshes and relative temporal movements. The efficacy of our approach is validated on several benchmark datasets. The results on in-the-wild videos and real-world scenarios demonstrate the superior performances of our approach for interactive hand reconstruction. More video results can be found on the project page: https://OmniHand.github.io.
Paper Structure (31 sections, 17 equations, 7 figures, 7 tables)

This paper contains 31 sections, 17 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: The proposed method, OmniHands, can robustly recover interactive hand meshes and their relative movement from monocular inputs.
  • Figure 2: Overview of our OmniHands framework. OmniHands is a transformer-based network, which takes various forms of inputs and estimates two-hand meshes with their relative positions. General modules are used to process different forms of input data. For simplicity of representation, since the hand tokens of both hands will undergo the same processing in FIR, the hand tokens $G^*, L^*$ in the diagram represent those of one hand. In the top left corner we present our training set, indicating the data volume and whether including Interaction, Time sequences, Multi-view, Real-World data.
  • Figure 3: Qualitative Comparison. We compare OmniHands with state-of-the-art methods on in-the-wild datasets ARCTICarctic and RenderIHrenderIH. ARCTIC provides two-hand object interacting images, and RenderIH provides interactive hands under different lighting conditions. We have demonstrated the results from both the camera perspective and an arbitrary perspective. The results show that our method outperforms the state-of-the-art methods in complex real-world environments.
  • Figure 4: Qualitative Results of in-the-wild hard cases. We show our model's results on complex cross-hand interactions in realistic scenarios to demonstrate its performance. For a detailed perception, views of multiple perspectives are demonstrated.
  • Figure 5: Qualitive results of OmniHands with multi-view inputs on Interhand2.6m interhand and Arctic arctic. In each row, we present pairs of input and output images from 4 different camera views. The images on the left side of each pair serve as the multi-view sequence input.
  • ...and 2 more figures