Table of Contents
Fetching ...

STAF: 3D Human Mesh Recovery from Video with Spatio-Temporal Alignment Fusion

Wei Yao, Hongwen Zhang, Yunlian Sun, Jinhui Tang

TL;DR

STAF tackles the challenge of 3D human mesh recovery from video by jointly exploiting spatial detail and temporal coherence. It introduces a spatio-temporal fusion framework with a Temporal Coherence Fusion Module (TCFM) and a Spatial Alignment Fusion Module (SAFM), augmented by an Average Pooling Module (APM) to reduce over-reliance on the target frame and improve sequence-wide smoothness. The method leverages multi-scale spatial features, mesh-projection cues, and adjacent-frame attention to produce accurate mesh parameters $oldsymbol{ heta}$, $oldsymbol{eta}$ for the target frame while maintaining temporal consistency, achieving a favorable balance between MPJPE/PA-MPJPE/PVE and acceleration error. Experiments on 3DPW, MPII3D, and Human3.6M demonstrate state-of-the-art performance with strong generalization and efficiency, aided by a two-stage training protocol and a compact model footprint.

Abstract

The recovery of 3D human mesh from monocular images has significantly been developed in recent years. However, existing models usually ignore spatial and temporal information, which might lead to mesh and image misalignment and temporal discontinuity. For this reason, we propose a novel Spatio-Temporal Alignment Fusion (STAF) model. As a video-based model, it leverages coherence clues from human motion by an attention-based Temporal Coherence Fusion Module (TCFM). As for spatial mesh-alignment evidence, we extract fine-grained local information through predicted mesh projection on the feature maps. Based on the spatial features, we further introduce a multi-stage adjacent Spatial Alignment Fusion Module (SAFM) to enhance the feature representation of the target frame. In addition to the above, we propose an Average Pooling Module (APM) to allow the model to focus on the entire input sequence rather than just the target frame. This method can remarkably improve the smoothness of recovery results from video. Extensive experiments on 3DPW, MPII3D, and H36M demonstrate the superiority of STAF. We achieve a state-of-the-art trade-off between precision and smoothness. Our code and more video results are on the project page https://yw0208.github.io/staf/

STAF: 3D Human Mesh Recovery from Video with Spatio-Temporal Alignment Fusion

TL;DR

STAF tackles the challenge of 3D human mesh recovery from video by jointly exploiting spatial detail and temporal coherence. It introduces a spatio-temporal fusion framework with a Temporal Coherence Fusion Module (TCFM) and a Spatial Alignment Fusion Module (SAFM), augmented by an Average Pooling Module (APM) to reduce over-reliance on the target frame and improve sequence-wide smoothness. The method leverages multi-scale spatial features, mesh-projection cues, and adjacent-frame attention to produce accurate mesh parameters , for the target frame while maintaining temporal consistency, achieving a favorable balance between MPJPE/PA-MPJPE/PVE and acceleration error. Experiments on 3DPW, MPII3D, and Human3.6M demonstrate state-of-the-art performance with strong generalization and efficiency, aided by a two-stage training protocol and a compact model footprint.

Abstract

The recovery of 3D human mesh from monocular images has significantly been developed in recent years. However, existing models usually ignore spatial and temporal information, which might lead to mesh and image misalignment and temporal discontinuity. For this reason, we propose a novel Spatio-Temporal Alignment Fusion (STAF) model. As a video-based model, it leverages coherence clues from human motion by an attention-based Temporal Coherence Fusion Module (TCFM). As for spatial mesh-alignment evidence, we extract fine-grained local information through predicted mesh projection on the feature maps. Based on the spatial features, we further introduce a multi-stage adjacent Spatial Alignment Fusion Module (SAFM) to enhance the feature representation of the target frame. In addition to the above, we propose an Average Pooling Module (APM) to allow the model to focus on the entire input sequence rather than just the target frame. This method can remarkably improve the smoothness of recovery results from video. Extensive experiments on 3DPW, MPII3D, and H36M demonstrate the superiority of STAF. We achieve a state-of-the-art trade-off between precision and smoothness. Our code and more video results are on the project page https://yw0208.github.io/staf/
Paper Structure (26 sections, 11 equations, 9 figures, 6 tables)

This paper contains 26 sections, 11 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Comparison with traditional video-based model MEVA meva. We choose MPJPE and acceleration error to measure the model’s performance in space and time. Thanks to our spatio-temporal fusion mechanism, our STAF surpasses MEVA in both metrics.
  • Figure 2: The difference between traditional video-based models and our STAF. STAF has an additional spatial encoder compared to traditional video-based models. As a result, STAF can obtain more comprehensive refined features and achieve higher recovery precision.
  • Figure 3: The overall framework of STAF. We input $T$ images and output the reconstruction result of the target frame $I_{\lceil T/2 \rceil}$ with a red border. We employ a feature pyramid to retain multi-scale spatial information and use projection down-sampling to obtain fine-grained local information. Also, to make full use of the spatio-temporal information, we add an average pooling module, a temporal coherence fusion module and a spatial alignment fusion module. The temporal coherence fusion module is described in Sec \ref{['moca']}, and the spatial alignment fusion module is in Sec \ref{['hafi']}. Please refer to Sec \ref{['STAF']} for the entire process of our method.
  • Figure 4: The structure of the temporal coherence fusion module. With $T$ features as input, the module outputs $T$ temporal refined features. We use TCFM to get initial human meshes. Note $\left\{ \Theta _{0,t} \right\} _{t=1}^{T}$ is set as the mean $\overline{\Theta }$ following hmr. As for the correlation matrix, it calculates the coherence between the frames by multiplying two feature matrices. The correlation matrix is a $T \times T$ matrix. The element of the i-th row and j-th column represent the coherence between the i-th frame and the j-th frame. Larger values indicate stronger coherence. The brighter color indicates a larger value.
  • Figure 5: The structure of spatial alignment feature fusion module. Take the example of entering nine features $\left\{ \boldsymbol{\phi }_{2,1}^{m},\,\boldsymbol{\phi }_{2,2}^{m},\,\cdots ,\,\boldsymbol{\phi }_{2,9}^{m} \right\}$. Start with a group of three features and integrate them into one feature through the attention module. Then the three integrated features $\left\{ \boldsymbol{\phi }_{2,p}^{m},\,\boldsymbol{\phi }_{2,c}^{m},\,\boldsymbol{\phi }_{2,f}^{m} \right\}$ are integrated again into one feature $\boldsymbol{\phi }_{2,ref}^{m}$. We use $\boldsymbol{\phi }_{2,ref}^{m}$ to recover the 3D human mesh of the target frame.
  • ...and 4 more figures