Table of Contents
Fetching ...

Pyramid Deep Fusion Network for Two-Hand Reconstruction from RGB-D Images

Jinwei Ren, Jianke Zhu

TL;DR

The paper addresses the challenge of recovering dense, real-world-scale two-hand meshes from RGB-D data, proposing an end-to-end framework that fuses RGB features from ResNet50 with depth-derived point-cloud features via a novel Pyramid Deep Fusion Network (PDFNet). The fused representation is decoded by a Chebyshev spectral Graph Convolutional Network to produce dense hand meshes in camera space, with center-map supervision and MANO-consistent outputs. Extensive ablations validate PDFNet's multi-scale fusion and the importance of adaptive feature transformation, center information, and robust depth integration, achieving state-of-the-art results on public two-hand datasets. This approach promises enhanced accuracy for AR/VR and HCI applications and provides a foundation for integrating temporal and multi-view cues in future work.

Abstract

Accurately recovering the dense 3D mesh of both hands from monocular images poses considerable challenges due to occlusions and projection ambiguity. Most of the existing methods extract features from color images to estimate the root-aligned hand meshes, which neglect the crucial depth and scale information in the real world. Given the noisy sensor measurements with limited resolution, depth-based methods predict 3D keypoints rather than a dense mesh. These limitations motivate us to take advantage of these two complementary inputs to acquire dense hand meshes on a real-world scale. In this work, we propose an end-to-end framework for recovering dense meshes for both hands, which employ single-view RGB-D image pairs as input. The primary challenge lies in effectively utilizing two different input modalities to mitigate the blurring effects in RGB images and noises in depth images. Instead of directly treating depth maps as additional channels for RGB images, we encode the depth information into the unordered point cloud to preserve more geometric details. Specifically, our framework employs ResNet50 and PointNet++ to derive features from RGB and point cloud, respectively. Additionally, we introduce a novel pyramid deep fusion network (PDFNet) to aggregate features at different scales, which demonstrates superior efficacy compared to previous fusion strategies. Furthermore, we employ a GCN-based decoder to process the fused features and recover the corresponding 3D pose and dense mesh. Through comprehensive ablation experiments, we have not only demonstrated the effectiveness of our proposed fusion algorithm but also outperformed the state-of-the-art approaches on publicly available datasets. To reproduce the results, we will make our source code and models publicly available at {https://github.com/zijinxuxu/PDFNet}.

Pyramid Deep Fusion Network for Two-Hand Reconstruction from RGB-D Images

TL;DR

The paper addresses the challenge of recovering dense, real-world-scale two-hand meshes from RGB-D data, proposing an end-to-end framework that fuses RGB features from ResNet50 with depth-derived point-cloud features via a novel Pyramid Deep Fusion Network (PDFNet). The fused representation is decoded by a Chebyshev spectral Graph Convolutional Network to produce dense hand meshes in camera space, with center-map supervision and MANO-consistent outputs. Extensive ablations validate PDFNet's multi-scale fusion and the importance of adaptive feature transformation, center information, and robust depth integration, achieving state-of-the-art results on public two-hand datasets. This approach promises enhanced accuracy for AR/VR and HCI applications and provides a foundation for integrating temporal and multi-view cues in future work.

Abstract

Accurately recovering the dense 3D mesh of both hands from monocular images poses considerable challenges due to occlusions and projection ambiguity. Most of the existing methods extract features from color images to estimate the root-aligned hand meshes, which neglect the crucial depth and scale information in the real world. Given the noisy sensor measurements with limited resolution, depth-based methods predict 3D keypoints rather than a dense mesh. These limitations motivate us to take advantage of these two complementary inputs to acquire dense hand meshes on a real-world scale. In this work, we propose an end-to-end framework for recovering dense meshes for both hands, which employ single-view RGB-D image pairs as input. The primary challenge lies in effectively utilizing two different input modalities to mitigate the blurring effects in RGB images and noises in depth images. Instead of directly treating depth maps as additional channels for RGB images, we encode the depth information into the unordered point cloud to preserve more geometric details. Specifically, our framework employs ResNet50 and PointNet++ to derive features from RGB and point cloud, respectively. Additionally, we introduce a novel pyramid deep fusion network (PDFNet) to aggregate features at different scales, which demonstrates superior efficacy compared to previous fusion strategies. Furthermore, we employ a GCN-based decoder to process the fused features and recover the corresponding 3D pose and dense mesh. Through comprehensive ablation experiments, we have not only demonstrated the effectiveness of our proposed fusion algorithm but also outperformed the state-of-the-art approaches on publicly available datasets. To reproduce the results, we will make our source code and models publicly available at {https://github.com/zijinxuxu/PDFNet}.
Paper Structure (17 sections, 9 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 17 sections, 9 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comparison between Ours-full method and Ours-RGB method. Although the results of the two methods are very similar under the original projection perspective, there is a large misalignment of the latter in the depth direction under the new perspective.
  • Figure 2: Overview of the proposed framework. Given an RGB-D image, we adopt ResNet50 He2016DeepRL and PointNet++ Qi2017PointNetDH as the backbone to extract features (Section \ref{['sec:encoder']}) and decode RGB features into center maps and masks using two simple decoders. The deep fusion module (Section \ref{['sec:pdfnet']}) is responsible for the deep fusion of RGB features and point features. The GCN-based decoder (Section \ref{['sec:decoder']}) takes the fused global feature and outputs dense hand mesh of both hands in a coarse to fine way. The whole pipeline is trained in an end-to-end manner.
  • Figure 3: Details of our proposed Pyramid Deep Fusion Network (PDFNet).
  • Figure 4: Visual comparison on the H2O Hampali2021KeypointTS dataset. We compared our results with DenseFusion Wang2019DenseFusion6O and IntagHand+D Li2022InteractingAG, and our results performed significantly better in hand-to-hand and hand-to-image alignment. We placed the predicted mesh and ground truth in the same coordinate system and color the left and right hands of the prediction in red and green respectively. From the side perspective, it can be seen that incorrect root node depth prediction can lead to significant misalignment.
  • Figure 5: Visualization results on the H2O-3D Hampali2021KeypointTS test set. We compared our results with IntagHand+D Li2022InteractingAG and DenseFusion Wang2019DenseFusion6O, and our results performed significantly better in hand-to-image alignment.
  • ...and 1 more figures