Dual-Branch Graph Transformer Network for 3D Human Mesh Reconstruction from Video

Tao Tang; Hong Liu; Yingxuan You; Ti Wang; Wenhao Li

Dual-Branch Graph Transformer Network for 3D Human Mesh Reconstruction from Video

Tao Tang, Hong Liu, Yingxuan You, Ti Wang, Wenhao Li

TL;DR

This work tackles monocular video-based 3D human mesh reconstruction by addressing the trade-off between reconstruction accuracy and motion smoothness. It introduces DGTR, a Dual-branch Graph Transformer with a Global Motion Attention (GMA) branch for long-range temporal modeling and a Local Details Refine (LDR) branch for local detail capture, integrated with a Modulated Graph Convolutional Network. The GMA branch uses a lightweight transformer over a window of frames ($T=16$, $Head=8$, $d=512$) to capture global motion, while the LDR branch combines 1D convolutions and MGCN within a transformer to enforce local constraints and extract fine details, with equations $f' = f + PE$, $m = Norm(MGCN(f'))$, and $Y = sigmoid(D^{-1/2} \tilde{A} D^{-1/2} X (W \odot V))$. On standard datasets ($3DPW$, $MPI-INF-3DHP$, $Human3.6M$), DGTR achieves state-of-the-art reconstruction accuracy and competitive motion smoothness with fewer parameters and FLOPs, validating its effectiveness for practical human-robot interaction applications.

Abstract

Human Mesh Reconstruction (HMR) from monocular video plays an important role in human-robot interaction and collaboration. However, existing video-based human mesh reconstruction methods face a trade-off between accurate reconstruction and smooth motion. These methods design networks based on either RNNs or attention mechanisms to extract local temporal correlations or global temporal dependencies, but the lack of complementary long-term information and local details limits their performance. To address this problem, we propose a \textbf{D}ual-branch \textbf{G}raph \textbf{T}ransformer network for 3D human mesh \textbf{R}econstruction from video, named DGTR. DGTR employs a dual-branch network including a Global Motion Attention (GMA) branch and a Local Details Refine (LDR) branch to parallelly extract long-term dependencies and local crucial information, helping model global human motion and local human details (e.g., local motion, tiny movement). Specifically, GMA utilizes a global transformer to model long-term human motion. LDR combines modulated graph convolutional networks and the transformer framework to aggregate local information in adjacent frames and extract crucial information of human details. Experiments demonstrate that our DGTR outperforms state-of-the-art video-based methods in reconstruction accuracy and maintains competitive motion smoothness. Moreover, DGTR utilizes fewer parameters and FLOPs, which validate the effectiveness and efficiency of the proposed DGTR. Code is publicly available at \href{https://github.com/TangTao-PKU/DGTR}{\textcolor{myBlue}{https://github.com/TangTao-PKU/DGTR}}.

Dual-Branch Graph Transformer Network for 3D Human Mesh Reconstruction from Video

TL;DR

Abstract

Dual-Branch Graph Transformer Network for 3D Human Mesh Reconstruction from Video

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)