Table of Contents
Fetching ...

BinaryHPE: 3D Human Pose and Shape Estimation via Binarization

Zhiteng Li, Yulun Zhang, Jing Lin, Haotong Qin, Jinjin Gu, Xin Yuan, Linghe Kong, Xiaokang Yang

TL;DR

3D human pose and shape estimation is powerful but resource-intensive. BinaryHPE introduces a binarized framework built on a BiDRN backbone with BiDRB blocks and a Binarized BoxNet to preserve essential full-precision information while dramatically reducing memory and compute. The method achieves strong results, outperforming existing SOTA binarized approaches and approaching, or even matching, full-precision Hand4Whole on key benchmarks, with a fraction of parameters and operations. This enables real-time, edge-device capable 3D mesh recovery for applications in AR/VR, sign language, and emotion recognition, advancing practical deployment of whole-body HPE.

Abstract

3D human pose and shape estimation (HPE) aims to reconstruct the 3D human body, face, and hands from a single image. Although powerful deep learning models have achieved accurate estimation in this task, they require enormous memory and computational resources. Consequently, these methods can hardly be deployed on resource-limited edge devices. In this work, we propose BinaryHPE, a novel binarization method designed to estimate the 3D human body, face, and hands parameters efficiently. Specifically, we propose a novel binary backbone called Binarized Dual Residual Network (BiDRN), designed to retain as much full-precision information as possible. Furthermore, we propose the Binarized BoxNet, an efficient sub-network for predicting face and hands bounding boxes, which further reduces model redundancy. Comprehensive quantitative and qualitative experiments demonstrate the effectiveness of BinaryHPE, which has a significant improvement over state-of-the-art binarization algorithms. Moreover, our BinaryHPE achieves comparable performance with the full-precision method Hand4Whole while using only 22.1% parameters and 14.8% operations. We will release all the code and pretrained models.

BinaryHPE: 3D Human Pose and Shape Estimation via Binarization

TL;DR

3D human pose and shape estimation is powerful but resource-intensive. BinaryHPE introduces a binarized framework built on a BiDRN backbone with BiDRB blocks and a Binarized BoxNet to preserve essential full-precision information while dramatically reducing memory and compute. The method achieves strong results, outperforming existing SOTA binarized approaches and approaching, or even matching, full-precision Hand4Whole on key benchmarks, with a fraction of parameters and operations. This enables real-time, edge-device capable 3D mesh recovery for applications in AR/VR, sign language, and emotion recognition, advancing practical deployment of whole-body HPE.

Abstract

3D human pose and shape estimation (HPE) aims to reconstruct the 3D human body, face, and hands from a single image. Although powerful deep learning models have achieved accurate estimation in this task, they require enormous memory and computational resources. Consequently, these methods can hardly be deployed on resource-limited edge devices. In this work, we propose BinaryHPE, a novel binarization method designed to estimate the 3D human body, face, and hands parameters efficiently. Specifically, we propose a novel binary backbone called Binarized Dual Residual Network (BiDRN), designed to retain as much full-precision information as possible. Furthermore, we propose the Binarized BoxNet, an efficient sub-network for predicting face and hands bounding boxes, which further reduces model redundancy. Comprehensive quantitative and qualitative experiments demonstrate the effectiveness of BinaryHPE, which has a significant improvement over state-of-the-art binarization algorithms. Moreover, our BinaryHPE achieves comparable performance with the full-precision method Hand4Whole while using only 22.1% parameters and 14.8% operations. We will release all the code and pretrained models.
Paper Structure (11 sections, 12 equations, 8 figures, 3 tables)

This paper contains 11 sections, 12 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Comparison of full-precision Hand4Whole, BNN, and BinaryHPE. The second line is Parameters (M) / Operations (G).
  • Figure 2: Comparison between recent BNNs and BinaryHPE on EHF. BinaryHPE significantly reduces the All MPVPEs (the lower, the better) of BNN hubara2016binarized, XNOR rastegari2016xnor, DoReFa zhou2016dorefa, Bi-Real liu2018bi, ReActNet liu2020reactnet, ReCU xu2021recu and FDA xu2021learning by 53.9, 77.2, 56.1, 43.4, 31.5, 24.4, and 53.5 respectively.
  • Figure 3: The overview pipeline of our binarized 3D human pose and shape estimation method. The body, hand, and face BiDRN serve as encoders to extract corresponding features. Binarized BoxNet predicts the face and hand regions based on the body features.
  • Figure 4: A Binarized Dual Residual Block (BiDRB) composed of both Local Convolution Residual (LCR) and Block Residual (BR).
  • Figure 5: Illustration of our Local Convolution (Base) Residual and four redesign modules, including (c) Down Scale Residual (DScR), (d) Fusion Up Residual (FUR), (e) Fusion Down Residual (FDR), and (f) Down Sample Residual (DSaR). The orange arrow denotes the full-precision information flow. For simplicity, batch normalization and Hardtanh pre-activation are omitted.
  • ...and 3 more figures