Cross-view and Cross-pose Completion for 3D Human Understanding
Matthieu Armando, Salma Galaaoui, Fabien Baradel, Thomas Lucas, Vincent Leroy, Romain Brégier, Philippe Weinzaepfel, Grégory Rogez
TL;DR
This paper tackles domain shift and label scarcity in human-centric vision by moving beyond ImageNet-style pre-training to self-supervised learning on human data. It introduces CroCo, a transformer-based pre-training framework that performs cross-view and cross-pose masked image modeling on body and hand imagery, reconstructing masked regions from a second image. The method builds cross-view pairs from HUMBI, AIST and synthetic data, and cross-pose pairs from video datasets, employing a human-focused masking strategy with a fixed ratio and an objective $L = L_{pose} + L_{view}$ to capture 3D structure and motion. CroCo-Body and CroCo-Hand achieve state-of-the-art or competitive results on body/hand mesh recovery and related dense tasks, exhibit strong data efficiency, and extend to binocular settings, with pretrained models released for downstream use.
Abstract
Human perception and understanding is a major domain of computer vision which, like many other vision subdomains recently, stands to gain from the use of large models pre-trained on large datasets. We hypothesize that the most common pre-training strategy of relying on general purpose, object-centric image datasets such as ImageNet, is limited by an important domain shift. On the other hand, collecting domain-specific ground truth such as 2D or 3D labels does not scale well. Therefore, we propose a pre-training approach based on self-supervised learning that works on human-centric data using only images. Our method uses pairs of images of humans: the first is partially masked and the model is trained to reconstruct the masked parts given the visible ones and a second image. It relies on both stereoscopic (cross-view) pairs, and temporal (cross-pose) pairs taken from videos, in order to learn priors about 3D as well as human motion. We pre-train a model for body-centric tasks and one for hand-centric tasks. With a generic transformer architecture, these models outperform existing self-supervised pre-training methods on a wide set of human-centric downstream tasks, and obtain state-of-the-art performance for instance when fine-tuning for model-based and model-free human mesh recovery.
