Multi Positive Contrastive Learning with Pose-Consistent Generated Images
Sho Inayoshi, Aji Resindra Widya, Satoshi Ozaki, Junji Otsuka, Takeshi Ohashi
TL;DR
GenPoCCL tackles the challenge of learning from generated data for human-centric perception by generating pose-consistent appearance-variant images and applying a pose-aware multi-positive contrastive objective. It introduces a [POSE] token to disentangle pose-structure features from appearance while using a combined loss $\mathcal{L}=\mathcal{L}_{rec}+\gamma_1\mathcal{L}_{align}+\gamma_2\mathcal{L}_{mp}$ with $\gamma_1=\gamma_2=0.05$, trained on pose-controlled GenCOCO and GenLUPerson datasets. Despite using under 1% of prior synthetic-data scale, GenPoCCL achieves superior performance across 2D pose estimation, person ReID, text-to-image ReID, and pedestrian attribute recognition, validating the benefits of pose-consistent data and the [POSE] token. Ablation studies and qualitative analyses corroborate that learning pose-structure alignment yields robust representations, with stronger augmentations further bridging the gap between real and synthetic domains.
Abstract
Model pre-training has become essential in various recognition tasks. Meanwhile, with the remarkable advancements in image generation models, pre-training methods utilizing generated images have also emerged given their ability to produce unlimited training data. However, while existing methods utilizing generated images excel in classification, they fall short in more practical tasks, such as human pose estimation. In this paper, we have experimentally demonstrated it and propose the generation of visually distinct images with identical human poses. We then propose a novel multi-positive contrastive learning, which optimally utilize the previously generated images to learn structural features of the human body. We term the entire learning pipeline as GenPoCCL. Despite using only less than 1% amount of data compared to current state-of-the-art method, GenPoCCL captures structural features of the human body more effectively, surpassing existing methods in a variety of human-centric perception tasks.
