Table of Contents
Fetching ...

Dual-Branch Graph Transformer Network for 3D Human Mesh Reconstruction from Video

Tao Tang, Hong Liu, Yingxuan You, Ti Wang, Wenhao Li

TL;DR

This work tackles monocular video-based 3D human mesh reconstruction by addressing the trade-off between reconstruction accuracy and motion smoothness. It introduces DGTR, a Dual-branch Graph Transformer with a Global Motion Attention (GMA) branch for long-range temporal modeling and a Local Details Refine (LDR) branch for local detail capture, integrated with a Modulated Graph Convolutional Network. The GMA branch uses a lightweight transformer over a window of frames ($T=16$, $Head=8$, $d=512$) to capture global motion, while the LDR branch combines 1D convolutions and MGCN within a transformer to enforce local constraints and extract fine details, with equations $f' = f + PE$, $m = Norm(MGCN(f'))$, and $Y = sigmoid(D^{-1/2} \tilde{A} D^{-1/2} X (W \odot V))$. On standard datasets ($3DPW$, $MPI-INF-3DHP$, $Human3.6M$), DGTR achieves state-of-the-art reconstruction accuracy and competitive motion smoothness with fewer parameters and FLOPs, validating its effectiveness for practical human-robot interaction applications.

Abstract

Human Mesh Reconstruction (HMR) from monocular video plays an important role in human-robot interaction and collaboration. However, existing video-based human mesh reconstruction methods face a trade-off between accurate reconstruction and smooth motion. These methods design networks based on either RNNs or attention mechanisms to extract local temporal correlations or global temporal dependencies, but the lack of complementary long-term information and local details limits their performance. To address this problem, we propose a \textbf{D}ual-branch \textbf{G}raph \textbf{T}ransformer network for 3D human mesh \textbf{R}econstruction from video, named DGTR. DGTR employs a dual-branch network including a Global Motion Attention (GMA) branch and a Local Details Refine (LDR) branch to parallelly extract long-term dependencies and local crucial information, helping model global human motion and local human details (e.g., local motion, tiny movement). Specifically, GMA utilizes a global transformer to model long-term human motion. LDR combines modulated graph convolutional networks and the transformer framework to aggregate local information in adjacent frames and extract crucial information of human details. Experiments demonstrate that our DGTR outperforms state-of-the-art video-based methods in reconstruction accuracy and maintains competitive motion smoothness. Moreover, DGTR utilizes fewer parameters and FLOPs, which validate the effectiveness and efficiency of the proposed DGTR. Code is publicly available at \href{https://github.com/TangTao-PKU/DGTR}{\textcolor{myBlue}{https://github.com/TangTao-PKU/DGTR}}.

Dual-Branch Graph Transformer Network for 3D Human Mesh Reconstruction from Video

TL;DR

This work tackles monocular video-based 3D human mesh reconstruction by addressing the trade-off between reconstruction accuracy and motion smoothness. It introduces DGTR, a Dual-branch Graph Transformer with a Global Motion Attention (GMA) branch for long-range temporal modeling and a Local Details Refine (LDR) branch for local detail capture, integrated with a Modulated Graph Convolutional Network. The GMA branch uses a lightweight transformer over a window of frames (, , ) to capture global motion, while the LDR branch combines 1D convolutions and MGCN within a transformer to enforce local constraints and extract fine details, with equations , , and . On standard datasets (, , ), DGTR achieves state-of-the-art reconstruction accuracy and competitive motion smoothness with fewer parameters and FLOPs, validating its effectiveness for practical human-robot interaction applications.

Abstract

Human Mesh Reconstruction (HMR) from monocular video plays an important role in human-robot interaction and collaboration. However, existing video-based human mesh reconstruction methods face a trade-off between accurate reconstruction and smooth motion. These methods design networks based on either RNNs or attention mechanisms to extract local temporal correlations or global temporal dependencies, but the lack of complementary long-term information and local details limits their performance. To address this problem, we propose a \textbf{D}ual-branch \textbf{G}raph \textbf{T}ransformer network for 3D human mesh \textbf{R}econstruction from video, named DGTR. DGTR employs a dual-branch network including a Global Motion Attention (GMA) branch and a Local Details Refine (LDR) branch to parallelly extract long-term dependencies and local crucial information, helping model global human motion and local human details (e.g., local motion, tiny movement). Specifically, GMA utilizes a global transformer to model long-term human motion. LDR combines modulated graph convolutional networks and the transformer framework to aggregate local information in adjacent frames and extract crucial information of human details. Experiments demonstrate that our DGTR outperforms state-of-the-art video-based methods in reconstruction accuracy and maintains competitive motion smoothness. Moreover, DGTR utilizes fewer parameters and FLOPs, which validate the effectiveness and efficiency of the proposed DGTR. Code is publicly available at \href{https://github.com/TangTao-PKU/DGTR}{\textcolor{myBlue}{https://github.com/TangTao-PKU/DGTR}}.

Paper Structure

This paper contains 17 sections, 5 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Comparision between accuracy (MPJPE) and parameters (left), FLOPs (right) of video-based methods. All methods are evaluated on the 3DPW dataset.
  • Figure 2: The overview of dual-branch graph transformer network for 3D human mesh reconstruction from video (DGTR). Given a video sequence, ResNet resnet is utilized to extract the static features. The static features of all frames and adjacent $3$ frames are separately fed into the GMA and LDR network. Then, GMA extracts global human motion, and local details of human motion are obtained by LDR. Finally, DGTR adds the output of the GMA and LDR branch and feeds it to the SMPL parameter regressor to generate the specific human mesh.
  • Figure 3: Different number of input frames of DGTR under acceleration error and PA-MPJPE.
  • Figure 4: Qualitative comparison of MPS-Net and DGTR on 3DPW dataset. Including fast motion, self-occlusion, and object occlusion scenarios.
  • Figure 5: Qualitative results of DGTR on Internet video. Including complex backgrounds, motion blur, and multi-person scenarios.
  • ...and 1 more figures