Table of Contents
Fetching ...

Cross-view and Cross-pose Completion for 3D Human Understanding

Matthieu Armando, Salma Galaaoui, Fabien Baradel, Thomas Lucas, Vincent Leroy, Romain Brégier, Philippe Weinzaepfel, Grégory Rogez

TL;DR

This paper tackles domain shift and label scarcity in human-centric vision by moving beyond ImageNet-style pre-training to self-supervised learning on human data. It introduces CroCo, a transformer-based pre-training framework that performs cross-view and cross-pose masked image modeling on body and hand imagery, reconstructing masked regions from a second image. The method builds cross-view pairs from HUMBI, AIST and synthetic data, and cross-pose pairs from video datasets, employing a human-focused masking strategy with a fixed ratio and an objective $L = L_{pose} + L_{view}$ to capture 3D structure and motion. CroCo-Body and CroCo-Hand achieve state-of-the-art or competitive results on body/hand mesh recovery and related dense tasks, exhibit strong data efficiency, and extend to binocular settings, with pretrained models released for downstream use.

Abstract

Human perception and understanding is a major domain of computer vision which, like many other vision subdomains recently, stands to gain from the use of large models pre-trained on large datasets. We hypothesize that the most common pre-training strategy of relying on general purpose, object-centric image datasets such as ImageNet, is limited by an important domain shift. On the other hand, collecting domain-specific ground truth such as 2D or 3D labels does not scale well. Therefore, we propose a pre-training approach based on self-supervised learning that works on human-centric data using only images. Our method uses pairs of images of humans: the first is partially masked and the model is trained to reconstruct the masked parts given the visible ones and a second image. It relies on both stereoscopic (cross-view) pairs, and temporal (cross-pose) pairs taken from videos, in order to learn priors about 3D as well as human motion. We pre-train a model for body-centric tasks and one for hand-centric tasks. With a generic transformer architecture, these models outperform existing self-supervised pre-training methods on a wide set of human-centric downstream tasks, and obtain state-of-the-art performance for instance when fine-tuning for model-based and model-free human mesh recovery.

Cross-view and Cross-pose Completion for 3D Human Understanding

TL;DR

This paper tackles domain shift and label scarcity in human-centric vision by moving beyond ImageNet-style pre-training to self-supervised learning on human data. It introduces CroCo, a transformer-based pre-training framework that performs cross-view and cross-pose masked image modeling on body and hand imagery, reconstructing masked regions from a second image. The method builds cross-view pairs from HUMBI, AIST and synthetic data, and cross-pose pairs from video datasets, employing a human-focused masking strategy with a fixed ratio and an objective to capture 3D structure and motion. CroCo-Body and CroCo-Hand achieve state-of-the-art or competitive results on body/hand mesh recovery and related dense tasks, exhibit strong data efficiency, and extend to binocular settings, with pretrained models released for downstream use.

Abstract

Human perception and understanding is a major domain of computer vision which, like many other vision subdomains recently, stands to gain from the use of large models pre-trained on large datasets. We hypothesize that the most common pre-training strategy of relying on general purpose, object-centric image datasets such as ImageNet, is limited by an important domain shift. On the other hand, collecting domain-specific ground truth such as 2D or 3D labels does not scale well. Therefore, we propose a pre-training approach based on self-supervised learning that works on human-centric data using only images. Our method uses pairs of images of humans: the first is partially masked and the model is trained to reconstruct the masked parts given the visible ones and a second image. It relies on both stereoscopic (cross-view) pairs, and temporal (cross-pose) pairs taken from videos, in order to learn priors about 3D as well as human motion. We pre-train a model for body-centric tasks and one for hand-centric tasks. With a generic transformer architecture, these models outperform existing self-supervised pre-training methods on a wide set of human-centric downstream tasks, and obtain state-of-the-art performance for instance when fine-tuning for model-based and model-free human mesh recovery.
Paper Structure (18 sections, 2 equations, 13 figures, 6 tables)

This paper contains 18 sections, 2 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Human-centric pre-training. We pre-train a model for cross-view and cross-pose completion on body and hands image pairs (middle). This model serves as initialization for fine-tuning on several downstream tasks for both hands (left) and bodies (right). Our model, based on a generic transformer architecture, achieves competitive performance on these tasks without bells and whistles.
  • Figure 2: Examples of pre-training pairs taken from the different pre-training datasets. denote multi-view datasets, video datasets and synthetic data.
  • Figure 3: Comparison with other pre-training methods on different downstream tasks (a) or under different fine-tuning data regimes (b), i.e., when varying the number of annotated training samples from COCO$_{part}$ for fine-tuning on the body mesh recovery task from 10% to 100%. MAE-Body/Hand means that we pre-train MAE on the same data as CroCo-Body/Hand.
  • Figure 4: Impact of the number of pre-training epochs. CroCo-Body is initialized from CroCo while MAE is initialized from ImageNet.
  • Figure 5: Evaluation scores of various pre-trained models on the texture estimation task of TexFormer texformer, at different fine-tuning stages. From left to right, we report SSIM$\uparrow$ (structural similarity index) and LPIPS$\downarrow$lpips metrics. All models return a single RGB texture.
  • ...and 8 more figures