Table of Contents
Fetching ...

DF-Mamba: Deformable State Space Modeling for 3D Hand Pose Estimation in Interactions

Yifan Zhou, Takehiko Ohkawa, Guwenxiao Zhou, Kanoko Goto, Takumi Hirose, Yusuke Sekikawa, Nakamasa Inoue

TL;DR

<3-5 sentence high-level summary> DF-Mamba introduces Deformable State-Space Modeling (DSSM) to extend Mamba backbones with deformable, anchor-based sampling for robust 3D hand pose estimation under occlusion. The backbone combines convolutional, deformable state-space, and gated convolution blocks in a tribrid design to efficiently capture local features and global context, achieving state-of-the-art results across five datasets with competitive speed. Extensive ablations show the deformable scan is essential and that the K=3^2 setting provides the best balance between flexibility and performance. The approach demonstrates strong generalization to single- and two-hand, RGB- and depth-based tasks, including challenging hand-object and egocentric scenarios, suggesting broad applicability and potential for pre-training and extension to broader 3D pose estimation tasks.

Abstract

Modeling daily hand interactions often struggles with severe occlusions, such as when two hands overlap, which highlights the need for robust feature learning in 3D hand pose estimation (HPE). To handle such occluded hand images, it is vital to effectively learn the relationship between local image features (e.g., for occluded joints) and global context (e.g., cues from inter-joints, inter-hands, or the scene). However, most current 3D HPE methods still rely on ResNet for feature extraction, and such CNN's inductive bias may not be optimal for 3D HPE due to its limited capability to model the global context. To address this limitation, we propose an effective and efficient framework for visual feature extraction in 3D HPE using recent state space modeling (i.e., Mamba), dubbed Deformable Mamba (DF-Mamba). DF-Mamba is designed to capture global context cues beyond standard convolution through Mamba's selective state modeling and the proposed deformable state scanning. Specifically, for local features after convolution, our deformable scanning aggregates these features within an image while selectively preserving useful cues that represent the global context. This approach significantly improves the accuracy of structured 3D HPE, with comparable inference speed to ResNet-50. Our experiments involve extensive evaluations on five divergent datasets including single-hand and two-hand scenarios, hand-only and hand-object interactions, as well as RGB and depth-based estimation. DF-Mamba outperforms the latest image backbones, including VMamba and Spatial-Mamba, on all datasets and achieves state-of-the-art performance.

DF-Mamba: Deformable State Space Modeling for 3D Hand Pose Estimation in Interactions

TL;DR

<3-5 sentence high-level summary> DF-Mamba introduces Deformable State-Space Modeling (DSSM) to extend Mamba backbones with deformable, anchor-based sampling for robust 3D hand pose estimation under occlusion. The backbone combines convolutional, deformable state-space, and gated convolution blocks in a tribrid design to efficiently capture local features and global context, achieving state-of-the-art results across five datasets with competitive speed. Extensive ablations show the deformable scan is essential and that the K=3^2 setting provides the best balance between flexibility and performance. The approach demonstrates strong generalization to single- and two-hand, RGB- and depth-based tasks, including challenging hand-object and egocentric scenarios, suggesting broad applicability and potential for pre-training and extension to broader 3D pose estimation tasks.

Abstract

Modeling daily hand interactions often struggles with severe occlusions, such as when two hands overlap, which highlights the need for robust feature learning in 3D hand pose estimation (HPE). To handle such occluded hand images, it is vital to effectively learn the relationship between local image features (e.g., for occluded joints) and global context (e.g., cues from inter-joints, inter-hands, or the scene). However, most current 3D HPE methods still rely on ResNet for feature extraction, and such CNN's inductive bias may not be optimal for 3D HPE due to its limited capability to model the global context. To address this limitation, we propose an effective and efficient framework for visual feature extraction in 3D HPE using recent state space modeling (i.e., Mamba), dubbed Deformable Mamba (DF-Mamba). DF-Mamba is designed to capture global context cues beyond standard convolution through Mamba's selective state modeling and the proposed deformable state scanning. Specifically, for local features after convolution, our deformable scanning aggregates these features within an image while selectively preserving useful cues that represent the global context. This approach significantly improves the accuracy of structured 3D HPE, with comparable inference speed to ResNet-50. Our experiments involve extensive evaluations on five divergent datasets including single-hand and two-hand scenarios, hand-only and hand-object interactions, as well as RGB and depth-based estimation. DF-Mamba outperforms the latest image backbones, including VMamba and Spatial-Mamba, on all datasets and achieves state-of-the-art performance.

Paper Structure

This paper contains 54 sections, 6 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Deformable scan for DF-Mamba. (a) Conventional sweep scan uses a fixed grid pattern in the state space equations. (b) Our deformable scan adaptively adjusts the scanning pattern with multiple anchors $\bm{a_{k}}$ by predicting offset vectors $\bm{o_{k,t}}$ dependent on input visual features.
  • Figure 2: Computational flow of SSM and DSSM. (a) As shown in Sec. \ref{['sec:preliminary']}, conventional SSM utilizes four matrices $\bm{\bar{A}}, \bm{\bar{B}}, \bm{C}, \bm{D}$, to compute the output $\bm{y}$ from an input $\bm{x}$ through intermediate representation $\bm{h}$. (b) Our DSSM incorporates weights $\bm{b_{k}}$ and offsets $\bm{o_{k}}$ for deformable scan into SSM.
  • Figure 3: Block architectures. (a) Vanilla Mamba block using SSM gu2024mamba (b) Gated convolution block that omits the SSM layer from the Mamba block (e.g., MambaOut yu2025mambaout). (c) VSS block used in VMamba liu2024vmamba. (d) Our DF-Mamba block that replaces the SSM layer with the DSSM layer in the Mamba block. (e) General architecture inspired by the transformer architecture vaswani2017attention. Each subblock highlighted with a gray background acts as a token mixer.
  • Figure 4: DF-Mamba backbone architecture. By combining three types of blocks, DF-Mamba improves the accuracy of 3D HPE while maintaining computational complexity comparable to or even lower than that of ResNet-50.
  • Figure 5: Qualitative examples. Predicted hand joints are color-coded by finger, with ground truth shown in black. The 3D visualizations provide rotated views, where several joint errors are highlighted using colored ellipses. The top two rows show examples from InterHand2.6M, and the bottom two rows from AssemblyHands.