Table of Contents
Fetching ...

PostoMETRO: Pose Token Enhanced Mesh Transformer for Robust 3D Human Mesh Recovery

Wendi Yang, Zihang Jiang, Shang Zhao, S. Kevin Zhou

TL;DR

This paper presents PostoMETRO (Pose token enhanced MEsh TRansfOrmer), which integrates 2D pose prior knowledge as tokens into transformers to improve model's performance under occlusion.

Abstract

With the recent advancements in single-image-based human mesh recovery, there is a growing interest in enhancing its performance in certain extreme scenarios, such as occlusion, while maintaining overall model accuracy. Although obtaining accurately annotated 3D human poses under occlusion is challenging, there is still a wealth of rich and precise 2D pose annotations that can be leveraged. However, existing works mostly focus on directly leveraging 2D pose coordinates to estimate 3D pose and mesh. In this paper, we present PostoMETRO($\textbf{Pos}$e $\textbf{to}$ken enhanced $\textbf{ME}$sh $\textbf{TR}$ansf$\textbf{O}$rmer), which integrates occlusion-resilient 2D pose representation into transformers in a token-wise manner. Utilizing a specialized pose tokenizer, we efficiently condense 2D pose data to a compact sequence of pose tokens and feed them to the transformer together with the image tokens. This process not only ensures a rich depiction of texture from the image but also fosters a robust integration of pose and image information. Subsequently, these combined tokens are queried by vertex and joint tokens to decode 3D coordinates of mesh vertices and human joints. Facilitated by the robust pose token representation and the effective combination, we are able to produce more precise 3D coordinates, even under extreme scenarios like occlusion. Experiments on both standard and occlusion-specific benchmarks demonstrate the effectiveness of PostoMETRO. Qualitative results further illustrate the clarity of how 2D pose can help 3D reconstruction. Code will be made available.

PostoMETRO: Pose Token Enhanced Mesh Transformer for Robust 3D Human Mesh Recovery

TL;DR

This paper presents PostoMETRO (Pose token enhanced MEsh TRansfOrmer), which integrates 2D pose prior knowledge as tokens into transformers to improve model's performance under occlusion.

Abstract

With the recent advancements in single-image-based human mesh recovery, there is a growing interest in enhancing its performance in certain extreme scenarios, such as occlusion, while maintaining overall model accuracy. Although obtaining accurately annotated 3D human poses under occlusion is challenging, there is still a wealth of rich and precise 2D pose annotations that can be leveraged. However, existing works mostly focus on directly leveraging 2D pose coordinates to estimate 3D pose and mesh. In this paper, we present PostoMETRO(e ken enhanced sh ansfrmer), which integrates occlusion-resilient 2D pose representation into transformers in a token-wise manner. Utilizing a specialized pose tokenizer, we efficiently condense 2D pose data to a compact sequence of pose tokens and feed them to the transformer together with the image tokens. This process not only ensures a rich depiction of texture from the image but also fosters a robust integration of pose and image information. Subsequently, these combined tokens are queried by vertex and joint tokens to decode 3D coordinates of mesh vertices and human joints. Facilitated by the robust pose token representation and the effective combination, we are able to produce more precise 3D coordinates, even under extreme scenarios like occlusion. Experiments on both standard and occlusion-specific benchmarks demonstrate the effectiveness of PostoMETRO. Qualitative results further illustrate the clarity of how 2D pose can help 3D reconstruction. Code will be made available.
Paper Structure (20 sections, 5 equations, 10 figures, 11 tables)

This paper contains 20 sections, 5 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: (a) Traditional approach of utilizing 2D pose information by converting it into 3D pose and subsequently generating a 3D mesh. (b) Transforming 2D pose into a sequence of tokens and integrating it with image tokens to generate a 3D mesh.
  • Figure 2: Overall pipeline of PostoMETRO (a) and architecture of transformer encoder-decoder (b). Before training transformers, we learn a "pose tokenizer" in VQ-VAE fashion, where a human image will be tokenized to discrete pose tokens, which can be reconstructed to a 2D pose by a decoder, according to the learned codebook. During training transformers, we freeze pose tokenizer and mesh upsampler and train transformers by concatenating pose token to camera token and image tokens as input for transformer encoders. For transformer decoders, learnable joint tokens and vertex tokens are fed as input. Following cho2022FastMetrokim2023PointHMR, dimensionality reduction and adjacency matrix are used for effectiveness. Note that pose confidence for 2D pose is omitted for simplicity.
  • Figure 3: Qualitative results on 3DPW (rows 1-3) and OCHuman (rows 4-5) datasets. From left to right: (a) Input image, (b) 2D Pose decoded from pose tokens, (c) FastMETRO cho2022FastMetro results, (d) Our results.
  • Figure 4: Per joint occlusion sensitivity analysis in PostoMETRO with or without joint tokens.
  • Figure 5: Detailed architecture of MLP-Mixer-based layer (a) and MLPBlock (b).
  • ...and 5 more figures