Human MotionFormer: Transferring Human Motions with Vision Transformers

Hongyu Liu; Xintong Han; Chengbin Jin; Lihui Qian; Huawei Wei; Zhe Lin; Faqiang Wang; Haoye Dong; Yibing Song; Jia Xu; Qifeng Chen

Human MotionFormer: Transferring Human Motions with Vision Transformers

Hongyu Liu, Xintong Han, Chengbin Jin, Lihui Qian, Huawei Wei, Zhe Lin, Faqiang Wang, Haoye Dong, Yibing Song, Jia Xu, Qifeng Chen

TL;DR

Human MotionFormer addresses the challenge of transferring target motions to a static source by using a dual-encoder, single-decoder Vision Transformer that couples global matching via cross-attention with local refinement through CNNs. A two-branch decoder (warping and generation) learnsComplementary motion transfer, guided by a mutual learning loss that aligns their features spatially. The model achieves state-of-the-art results on YouTube and iPer datasets in a one-shot setting without per-person fine-tuning, indicating strong generalization for realistic motion synthesis. This approach advances motion transfer by combining global correspondence modeling with local detail preservation, enabling high-quality, photorealistic results for diverse poses and appearances.

Abstract

Human motion transfer aims to transfer motions from a target dynamic person to a source static one for motion synthesis. An accurate matching between the source person and the target motion in both large and subtle motion changes is vital for improving the transferred motion quality. In this paper, we propose Human MotionFormer, a hierarchical ViT framework that leverages global and local perceptions to capture large and subtle motion matching, respectively. It consists of two ViT encoders to extract input features (i.e., a target motion image and a source human image) and a ViT decoder with several cascaded blocks for feature matching and motion transfer. In each block, we set the target motion feature as Query and the source person as Key and Value, calculating the cross-attention maps to conduct a global feature matching. Further, we introduce a convolutional layer to improve the local perception after the global cross-attention computations. This matching process is implemented in both warping and generation branches to guide the motion transfer. During training, we propose a mutual learning loss to enable the co-supervision between warping and generation branches for better motion representations. Experiments show that our Human MotionFormer sets the new state-of-the-art performance both qualitatively and quantitatively. Project page: \url{https://github.com/KumapowerLIU/Human-MotionFormer}

Human MotionFormer: Transferring Human Motions with Vision Transformers

TL;DR

Abstract

Paper Structure (19 sections, 14 equations, 12 figures, 4 tables, 1 algorithm)

This paper contains 19 sections, 14 equations, 12 figures, 4 tables, 1 algorithm.

Introduction
Related Works
Proposed Method
Transformer Encoder
Transformer Decoder
Decoder block.
Fusion block.
Mutual learning loss
Experiments
Qualitative Comparisons
Quantitative Comparisons
Ablation Study
Concluding Remarks
Appendix
Details of Attention Process and Encoder
...and 4 more sections

Figures (12)

Figure 1: Human motion transfer results. Target pose images are in the first row, and two source person images are in the first column. Our MotionFormer effectively synthesizes motion transferred results whether the poses in the above two images differ significantly or not.
Figure 2: Overview of our MotionFormer framework. We use two Transformer encoders to extract features of the source image $I_s$ and the target pose image $P_t$. These two features are hierarchically combined in one Transformer decoder where there are multiple decoder blocks. Finally, a fusion block synthesizes the output image by blending the warped and generated images.
Figure 3: Overview of our decoder and fusion blocks. There are warping and generation branches in these two blocks. In the decoder block, We build the global and local correspondence between the source image and target pose with Multi-Head Cross-Attention and CNN respectively. The fusion block predicts a mask to combine the output of two branches at the pixel level.
Figure 4: Visual comparison of state-of-the-art approaches and our method on YouTube videos dataset. Our proposed framework generates images with the highest visual quality.
Figure 5: Visual comparison of state-of-the-art approaches and our method on iPer dataset. Our proposed framework generates images with the highest visual quality.
...and 7 more figures

Human MotionFormer: Transferring Human Motions with Vision Transformers

TL;DR

Abstract

Human MotionFormer: Transferring Human Motions with Vision Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (12)