Cross-view and Cross-pose Completion for 3D Human Understanding

Matthieu Armando; Salma Galaaoui; Fabien Baradel; Thomas Lucas; Vincent Leroy; Romain Brégier; Philippe Weinzaepfel; Grégory Rogez

Cross-view and Cross-pose Completion for 3D Human Understanding

Matthieu Armando, Salma Galaaoui, Fabien Baradel, Thomas Lucas, Vincent Leroy, Romain Brégier, Philippe Weinzaepfel, Grégory Rogez

TL;DR

This paper tackles domain shift and label scarcity in human-centric vision by moving beyond ImageNet-style pre-training to self-supervised learning on human data. It introduces CroCo, a transformer-based pre-training framework that performs cross-view and cross-pose masked image modeling on body and hand imagery, reconstructing masked regions from a second image. The method builds cross-view pairs from HUMBI, AIST and synthetic data, and cross-pose pairs from video datasets, employing a human-focused masking strategy with a fixed ratio and an objective $L = L_{pose} + L_{view}$ to capture 3D structure and motion. CroCo-Body and CroCo-Hand achieve state-of-the-art or competitive results on body/hand mesh recovery and related dense tasks, exhibit strong data efficiency, and extend to binocular settings, with pretrained models released for downstream use.

Abstract

Human perception and understanding is a major domain of computer vision which, like many other vision subdomains recently, stands to gain from the use of large models pre-trained on large datasets. We hypothesize that the most common pre-training strategy of relying on general purpose, object-centric image datasets such as ImageNet, is limited by an important domain shift. On the other hand, collecting domain-specific ground truth such as 2D or 3D labels does not scale well. Therefore, we propose a pre-training approach based on self-supervised learning that works on human-centric data using only images. Our method uses pairs of images of humans: the first is partially masked and the model is trained to reconstruct the masked parts given the visible ones and a second image. It relies on both stereoscopic (cross-view) pairs, and temporal (cross-pose) pairs taken from videos, in order to learn priors about 3D as well as human motion. We pre-train a model for body-centric tasks and one for hand-centric tasks. With a generic transformer architecture, these models outperform existing self-supervised pre-training methods on a wide set of human-centric downstream tasks, and obtain state-of-the-art performance for instance when fine-tuning for model-based and model-free human mesh recovery.

Cross-view and Cross-pose Completion for 3D Human Understanding

TL;DR

to capture 3D structure and motion. CroCo-Body and CroCo-Hand achieve state-of-the-art or competitive results on body/hand mesh recovery and related dense tasks, exhibit strong data efficiency, and extend to binocular settings, with pretrained models released for downstream use.

Abstract

Paper Structure (18 sections, 2 equations, 13 figures, 6 tables)

This paper contains 18 sections, 2 equations, 13 figures, 6 tables.

Introduction
Related work
Method
Multi-view masked image modeling
Cross-view pair construction
Cross-pose pairs construction
Fine-tuning on downstream tasks
Experiments
Downstream tasks
Ablation studies
Comparison to the state of the art
Extension to binocular tasks
Conclusion
Human texture estimation
Training time
...and 3 more sections

Figures (13)

Figure 1: Human-centric pre-training. We pre-train a model for cross-view and cross-pose completion on body and hands image pairs (middle). This model serves as initialization for fine-tuning on several downstream tasks for both hands (left) and bodies (right). Our model, based on a generic transformer architecture, achieves competitive performance on these tasks without bells and whistles.
Figure 2: Examples of pre-training pairs taken from the different pre-training datasets. denote multi-view datasets, video datasets and synthetic data.
Figure 3: Comparison with other pre-training methods on different downstream tasks (a) or under different fine-tuning data regimes (b), i.e., when varying the number of annotated training samples from COCO$_{part}$ for fine-tuning on the body mesh recovery task from 10% to 100%. MAE-Body/Hand means that we pre-train MAE on the same data as CroCo-Body/Hand.
Figure 4: Impact of the number of pre-training epochs. CroCo-Body is initialized from CroCo while MAE is initialized from ImageNet.
Figure 5: Evaluation scores of various pre-trained models on the texture estimation task of TexFormer texformer, at different fine-tuning stages. From left to right, we report SSIM$\uparrow$ (structural similarity index) and LPIPS$\downarrow$lpips metrics. All models return a single RGB texture.
...and 8 more figures

Cross-view and Cross-pose Completion for 3D Human Understanding

TL;DR

Abstract

Cross-view and Cross-pose Completion for 3D Human Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (13)