Table of Contents
Fetching ...

Disambiguating Monocular Reconstruction of 3D Clothed Human with Spatial-Temporal Transformer

Yong Deng, Baoxing Li, Xu Zhao

TL;DR

The Spatial-Temporal Transformer (STT) network is proposed for 3D clothed human reconstruction and outperforms state-of-the-art methods and maintains robust generalization even under low-light outdoor conditions.

Abstract

Reconstructing 3D clothed humans from monocular camera data is highly challenging due to viewpoint limitations and image ambiguity. While implicit function-based approaches, combined with prior knowledge from parametric models, have made significant progress, there are still two notable problems. Firstly, the back details of human models are ambiguous due to viewpoint invisibility. The quality of the back details depends on the back normal map predicted by a convolutional neural network (CNN). However, the CNN lacks global information awareness for comprehending the back texture, resulting in excessively smooth back details. Secondly, a single image suffers from local ambiguity due to lighting conditions and body movement. However, implicit functions are highly sensitive to pixel variations in ambiguous regions. To address these ambiguities, we propose the Spatial-Temporal Transformer (STT) network for 3D clothed human reconstruction. A spatial transformer is employed to extract global information for normal map prediction. The establishment of global correlations facilitates the network in comprehending the holistic texture and shape of the human body. Simultaneously, to compensate for local ambiguity in images, a temporal transformer is utilized to extract temporal features from adjacent frames. The incorporation of temporal features can enhance the accuracy of input features in implicit networks. Furthermore, to obtain more accurate temporal features, joint tokens are employed to establish local correspondences between frames. Experimental results on the Adobe and MonoPerfCap datasets have shown that our method outperforms state-of-the-art methods and maintains robust generalization even under low-light outdoor conditions.

Disambiguating Monocular Reconstruction of 3D Clothed Human with Spatial-Temporal Transformer

TL;DR

The Spatial-Temporal Transformer (STT) network is proposed for 3D clothed human reconstruction and outperforms state-of-the-art methods and maintains robust generalization even under low-light outdoor conditions.

Abstract

Reconstructing 3D clothed humans from monocular camera data is highly challenging due to viewpoint limitations and image ambiguity. While implicit function-based approaches, combined with prior knowledge from parametric models, have made significant progress, there are still two notable problems. Firstly, the back details of human models are ambiguous due to viewpoint invisibility. The quality of the back details depends on the back normal map predicted by a convolutional neural network (CNN). However, the CNN lacks global information awareness for comprehending the back texture, resulting in excessively smooth back details. Secondly, a single image suffers from local ambiguity due to lighting conditions and body movement. However, implicit functions are highly sensitive to pixel variations in ambiguous regions. To address these ambiguities, we propose the Spatial-Temporal Transformer (STT) network for 3D clothed human reconstruction. A spatial transformer is employed to extract global information for normal map prediction. The establishment of global correlations facilitates the network in comprehending the holistic texture and shape of the human body. Simultaneously, to compensate for local ambiguity in images, a temporal transformer is utilized to extract temporal features from adjacent frames. The incorporation of temporal features can enhance the accuracy of input features in implicit networks. Furthermore, to obtain more accurate temporal features, joint tokens are employed to establish local correspondences between frames. Experimental results on the Adobe and MonoPerfCap datasets have shown that our method outperforms state-of-the-art methods and maintains robust generalization even under low-light outdoor conditions.

Paper Structure

This paper contains 18 sections, 12 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: The current method of implicit reconstruction has two problems: (a) Ambiguity in back details due to the invisibility. The perception of global information can help the normal network to infer back details. (b) Ambiguity in human images due to lighting conditions and body movement. However, there may be some improvement in corresponding areas in adjacent frames.
  • Figure 2: Our method comprises two pivotal modules: (1) a Spatial Transformer (S-Trans) for normal map prediction and (2) a Temporal Transformer (T-Trans) for temporal information extraction from sequences of normal maps. Different procedures are employed for visible (orange box) and invisible (green box) point, with the primary distinction lying in the normal map prediction module. Two S-Trans with identical structures are employed to separately predict the front and back maps. The front-normal S-Trans uses the image as input, while the back-normal S-Trans utilizes the front normal map. The T-Trans utilizes the same module across both prediction processes. The Joint Tokens are employed to guide the correspondence between adjacent frames in network learning. The output joint positions serve as a supervisory signal for the network. In addition, to enhance the network's learning in ambiguous areas, we introduced random mask during the training process in the second stage. Finally, the input of the implicit function consists of 2D features (Normal map), 3D features (SDF), and temporal features (T-Trans).
  • Figure 3: The internal architecture of the transformer module. (a) shows the spatial transformer employed for normal prediction, and (b) illustrates the temporal transformer utilized for temporal feature extraction. The structures of the two encoders are identical, but their parameter settings are different. The spatial transformer uses a multi-layer perceptron to transform the encoded global features into a normal map. The temporal transformer uses a decoder to aggregate the features between frames and convert them into temporal features. Additionally, joint tokens are introduced in the temporal transformer. The output joint positions serve as supervised guidance to assist the network in learning the correspondence between frames.
  • Figure 4: Qualitative comparison with the state-of-the-art method on Adobe dataset Adobe. The improved blurry image regions are outlined with a dashed line. Our method demonstrates improvements over PIFuHD PIFuHD and ICON ICON in the reconstruction of details in blurry regions of the image and in preserving the structural integrity of the human body. ECON ECON can also maintain the integrity of local structures, but its performance in details is inferior to our method. Additionally, as indicated by the boxed area, it is vulnerable to the accuracy of parametric model estimation.
  • Figure 5: Qualitative comparison of the spatial transformer (S-Trans) for normal prediction with the 2D convolutional module from ICON ICON (Conv) and the 2D convolutional module from TCR TCR with replaced inputs (Conv-refine). When compared to convolutional networks, S-Trans's extraction of global information greatly enhances the prediction of details in invisible areas.
  • ...and 2 more figures