Table of Contents
Fetching ...

Multi-view Hand Reconstruction with a Point-Embedded Transformer

Lixin Yang, Licheng Zhong, Pengxiang Zhu, Xinyu Zhan, Junxiao Kong, Jian Xu, Cewu Lu

TL;DR

This paper tackles robust 3D hand mesh reconstruction from multi-view RGB inputs by introducing POEM, a generalizable framework that represents the hand with a fixed Basis Points Set embedded in the multi-view space and refines root-relative points via a Point-Embedded Transformer. The core ideas are: (i) embedding a static 3D basis-point cloud within the intersection of camera frustums to enable camera-configuration-invariant fusion, and (ii) a two-stage architecture that first estimates the absolute root via triangulation and then predicts root-relative hand geometry using a dedicated transformer conditioned on cross-view basis features. Key contributions include the Point-Embedded Intersection Sphere, Projective Aggregation for cross-view fusion, learnable query point representations, and the POEM_v2 generalizable training regime across five large multi-view datasets with randomized camera configurations. The proposed method demonstrates strong generalization to diverse real-world settings, supports both left and right hands, and achieves competitive or superior accuracy with efficient inference, offering a practical plug-and-play solution for multi-view hand motion capture. This work advances multi-view HMR by decoupling camera extrinsics from learning through a fixed, spatially informed basis-point representation and a specialized Transformer, enabling scalable, dataset-agnostic hand reconstruction for applications such as teleoperation and AR/VR interactions.

Abstract

This work introduces a novel and generalizable multi-view Hand Mesh Reconstruction (HMR) model, named POEM, designed for practical use in real-world hand motion capture scenarios. The advances of the POEM model consist of two main aspects. First, concerning the modeling of the problem, we propose embedding a static basis point within the multi-view stereo space. A point represents a natural form of 3D information and serves as an ideal medium for fusing features across different views, given its varied projections across these views. Consequently, our method harnesses a simple yet effective idea: a complex 3D hand mesh can be represented by a set of 3D basis points that 1) are embedded in the multi-view stereo, 2) carry features from the multi-view images, and 3) encompass the hand in it. The second advance lies in the training strategy. We utilize a combination of five large-scale multi-view datasets and employ randomization in the number, order, and poses of the cameras. By processing such a vast amount of data and a diverse array of camera configurations, our model demonstrates notable generalizability in the real-world applications. As a result, POEM presents a highly practical, plug-and-play solution that enables user-friendly, cost-effective multi-view motion capture for both left and right hands. The model and source codes are available at https://github.com/JubSteven/POEM-v2.

Multi-view Hand Reconstruction with a Point-Embedded Transformer

TL;DR

This paper tackles robust 3D hand mesh reconstruction from multi-view RGB inputs by introducing POEM, a generalizable framework that represents the hand with a fixed Basis Points Set embedded in the multi-view space and refines root-relative points via a Point-Embedded Transformer. The core ideas are: (i) embedding a static 3D basis-point cloud within the intersection of camera frustums to enable camera-configuration-invariant fusion, and (ii) a two-stage architecture that first estimates the absolute root via triangulation and then predicts root-relative hand geometry using a dedicated transformer conditioned on cross-view basis features. Key contributions include the Point-Embedded Intersection Sphere, Projective Aggregation for cross-view fusion, learnable query point representations, and the POEM_v2 generalizable training regime across five large multi-view datasets with randomized camera configurations. The proposed method demonstrates strong generalization to diverse real-world settings, supports both left and right hands, and achieves competitive or superior accuracy with efficient inference, offering a practical plug-and-play solution for multi-view hand motion capture. This work advances multi-view HMR by decoupling camera extrinsics from learning through a fixed, spatially informed basis-point representation and a specialized Transformer, enabling scalable, dataset-agnostic hand reconstruction for applications such as teleoperation and AR/VR interactions.

Abstract

This work introduces a novel and generalizable multi-view Hand Mesh Reconstruction (HMR) model, named POEM, designed for practical use in real-world hand motion capture scenarios. The advances of the POEM model consist of two main aspects. First, concerning the modeling of the problem, we propose embedding a static basis point within the multi-view stereo space. A point represents a natural form of 3D information and serves as an ideal medium for fusing features across different views, given its varied projections across these views. Consequently, our method harnesses a simple yet effective idea: a complex 3D hand mesh can be represented by a set of 3D basis points that 1) are embedded in the multi-view stereo, 2) carry features from the multi-view images, and 3) encompass the hand in it. The second advance lies in the training strategy. We utilize a combination of five large-scale multi-view datasets and employ randomization in the number, order, and poses of the cameras. By processing such a vast amount of data and a diverse array of camera configurations, our model demonstrates notable generalizability in the real-world applications. As a result, POEM presents a highly practical, plug-and-play solution that enables user-friendly, cost-effective multi-view motion capture for both left and right hands. The model and source codes are available at https://github.com/JubSteven/POEM-v2.
Paper Structure (23 sections, 17 equations, 9 figures, 8 tables)

This paper contains 23 sections, 17 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: The architecture of POEM model. The first stage estimates the root position $\mathbf{R}$ and the second stage reconstructs the query points $\mathbf{X}$.
  • Figure 2: Illustration of the intersection sphere approximation.
  • Figure 3: Architecture of the projective aggregation module.
  • Figure 4: One layer of the Point-Embedded Transformer Decoder.
  • Figure 5: Architecture of the PE-MeshTR and FTL-MeshTR.
  • ...and 4 more figures