Table of Contents
Fetching ...

Hamba: Single-view 3D Hand Reconstruction with Graph-guided Bi-Scanning Mamba

Haoye Dong, Aviral Chharia, Wenbo Gou, Francisco Vicente Carrasco, Fernando De la Torre

TL;DR

Hamba tackles robust single-view 3D hand reconstruction by integrating graph learning with state-space modeling in a graph-guided bidirectional scanning framework. The core innovations include the Graph-guided State Space (GSS) block and a Token Sampler that, together with a graph-guided bidirectional scan (GBS), learn spatial relations among hand joints with far fewer tokens than attention-based methods. Empirical results on FreiHAND, HO3D v2/v3, and in-the-wild benchmarks demonstrate state-of-the-art accuracy (e.g., PA-MPVPE ~5.3 mm and F@15mm ~0.992 on FreiHAND) and strong generalization, with ablations underscoring the value of each component. The approach offers a token-efficient, graph-aware alternative to transformers for high-fidelity 3D hand mesh reconstruction, with potential applicability to related 3D human pose tasks and a plan to release code and models.

Abstract

3D Hand reconstruction from a single RGB image is challenging due to the articulated motion, self-occlusion, and interaction with objects. Existing SOTA methods employ attention-based transformers to learn the 3D hand pose and shape, yet they do not fully achieve robust and accurate performance, primarily due to inefficiently modeling spatial relations between joints. To address this problem, we propose a novel graph-guided Mamba framework, named Hamba, which bridges graph learning and state space modeling. Our core idea is to reformulate Mamba's scanning into graph-guided bidirectional scanning for 3D reconstruction using a few effective tokens. This enables us to efficiently learn the spatial relationships between joints for improving reconstruction performance. Specifically, we design a Graph-guided State Space (GSS) block that learns the graph-structured relations and spatial sequences of joints and uses 88.5% fewer tokens than attention-based methods. Additionally, we integrate the state space features and the global features using a fusion module. By utilizing the GSS block and the fusion module, Hamba effectively leverages the graph-guided state space features and jointly considers global and local features to improve performance. Experiments on several benchmarks and in-the-wild tests demonstrate that Hamba significantly outperforms existing SOTAs, achieving the PA-MPVPE of 5.3mm and F@15mm of 0.992 on FreiHAND. At the time of this paper's acceptance, Hamba holds the top position, Rank 1 in two Competition Leaderboards on 3D hand reconstruction. Project Website: https://humansensinglab.github.io/Hamba/

Hamba: Single-view 3D Hand Reconstruction with Graph-guided Bi-Scanning Mamba

TL;DR

Hamba tackles robust single-view 3D hand reconstruction by integrating graph learning with state-space modeling in a graph-guided bidirectional scanning framework. The core innovations include the Graph-guided State Space (GSS) block and a Token Sampler that, together with a graph-guided bidirectional scan (GBS), learn spatial relations among hand joints with far fewer tokens than attention-based methods. Empirical results on FreiHAND, HO3D v2/v3, and in-the-wild benchmarks demonstrate state-of-the-art accuracy (e.g., PA-MPVPE ~5.3 mm and F@15mm ~0.992 on FreiHAND) and strong generalization, with ablations underscoring the value of each component. The approach offers a token-efficient, graph-aware alternative to transformers for high-fidelity 3D hand mesh reconstruction, with potential applicability to related 3D human pose tasks and a plan to release code and models.

Abstract

3D Hand reconstruction from a single RGB image is challenging due to the articulated motion, self-occlusion, and interaction with objects. Existing SOTA methods employ attention-based transformers to learn the 3D hand pose and shape, yet they do not fully achieve robust and accurate performance, primarily due to inefficiently modeling spatial relations between joints. To address this problem, we propose a novel graph-guided Mamba framework, named Hamba, which bridges graph learning and state space modeling. Our core idea is to reformulate Mamba's scanning into graph-guided bidirectional scanning for 3D reconstruction using a few effective tokens. This enables us to efficiently learn the spatial relationships between joints for improving reconstruction performance. Specifically, we design a Graph-guided State Space (GSS) block that learns the graph-structured relations and spatial sequences of joints and uses 88.5% fewer tokens than attention-based methods. Additionally, we integrate the state space features and the global features using a fusion module. By utilizing the GSS block and the fusion module, Hamba effectively leverages the graph-guided state space features and jointly considers global and local features to improve performance. Experiments on several benchmarks and in-the-wild tests demonstrate that Hamba significantly outperforms existing SOTAs, achieving the PA-MPVPE of 5.3mm and F@15mm of 0.992 on FreiHAND. At the time of this paper's acceptance, Hamba holds the top position, Rank 1 in two Competition Leaderboards on 3D hand reconstruction. Project Website: https://humansensinglab.github.io/Hamba/
Paper Structure (17 sections, 11 equations, 12 figures, 8 tables, 1 algorithm)

This paper contains 17 sections, 11 equations, 12 figures, 8 tables, 1 algorithm.

Figures (12)

  • Figure 1: In-the-wild visual results of Hamba. Hamba achieves significant performance in various in-the-wild scenarios, including hand interaction with objects or hands, different skin tones, different angles, challenging paintings, and vivid animations.
  • Figure 2: Motivation. Visual comparisons of different scanning flows. (a) Attention methods compute the correlation across all patches leading to a very high number of tokens. (b) Bidirectional scans follow two paths, resulting in less complexity. (c) The proposed graph-guided bidirectional scan (GBS) achieves effective state space modeling leveraging graph learning with a few effective tokens (illustrated as scanning by two snakes: forward and backward scanning snakes).
  • Figure 3: Overview of Hamba's architecture. Given a hand image $I$, tokens are extracted via a trainable backbone model and downsampled. We design a graph-guided SSM as a decoder to regress hand parameters. The hand joints ($J_{\text{2D}}$) are regressed by Joints Regressor (JR) and fed into the Token Sampler (TS) to sample tokens ($T_{\text{TS}}$). The joint spatial sequence tokens ($T_{\text{GSS}}$) are learned by the Graph-guided State Space (GSS) blocks. Inside each GSS block, the GCN network takes $T_{\text{TS}}$ as input and its output is concatenated with the mean down-sampled tokens. GSS leverages graph learning and state space modeling to capture the joint spatial relations to achieve robust 3D reconstruction.
  • Figure 4: The illustration of the proposed Graph-guided State Space (GSS) block.
  • Figure 5: Qualitative in-the-wild comparison of the proposed Hamba with SOTAs on HInt-EpicKitchensVISOR damen2018scalingpavlakos2024reconstructing. None of the models (including Hamba) have been trained on HInt.
  • ...and 7 more figures