Table of Contents
Fetching ...

Multi Positive Contrastive Learning with Pose-Consistent Generated Images

Sho Inayoshi, Aji Resindra Widya, Satoshi Ozaki, Junji Otsuka, Takeshi Ohashi

TL;DR

GenPoCCL tackles the challenge of learning from generated data for human-centric perception by generating pose-consistent appearance-variant images and applying a pose-aware multi-positive contrastive objective. It introduces a [POSE] token to disentangle pose-structure features from appearance while using a combined loss $\mathcal{L}=\mathcal{L}_{rec}+\gamma_1\mathcal{L}_{align}+\gamma_2\mathcal{L}_{mp}$ with $\gamma_1=\gamma_2=0.05$, trained on pose-controlled GenCOCO and GenLUPerson datasets. Despite using under 1% of prior synthetic-data scale, GenPoCCL achieves superior performance across 2D pose estimation, person ReID, text-to-image ReID, and pedestrian attribute recognition, validating the benefits of pose-consistent data and the [POSE] token. Ablation studies and qualitative analyses corroborate that learning pose-structure alignment yields robust representations, with stronger augmentations further bridging the gap between real and synthetic domains.

Abstract

Model pre-training has become essential in various recognition tasks. Meanwhile, with the remarkable advancements in image generation models, pre-training methods utilizing generated images have also emerged given their ability to produce unlimited training data. However, while existing methods utilizing generated images excel in classification, they fall short in more practical tasks, such as human pose estimation. In this paper, we have experimentally demonstrated it and propose the generation of visually distinct images with identical human poses. We then propose a novel multi-positive contrastive learning, which optimally utilize the previously generated images to learn structural features of the human body. We term the entire learning pipeline as GenPoCCL. Despite using only less than 1% amount of data compared to current state-of-the-art method, GenPoCCL captures structural features of the human body more effectively, surpassing existing methods in a variety of human-centric perception tasks.

Multi Positive Contrastive Learning with Pose-Consistent Generated Images

TL;DR

GenPoCCL tackles the challenge of learning from generated data for human-centric perception by generating pose-consistent appearance-variant images and applying a pose-aware multi-positive contrastive objective. It introduces a [POSE] token to disentangle pose-structure features from appearance while using a combined loss with , trained on pose-controlled GenCOCO and GenLUPerson datasets. Despite using under 1% of prior synthetic-data scale, GenPoCCL achieves superior performance across 2D pose estimation, person ReID, text-to-image ReID, and pedestrian attribute recognition, validating the benefits of pose-consistent data and the [POSE] token. Ablation studies and qualitative analyses corroborate that learning pose-structure alignment yields robust representations, with stronger augmentations further bridging the gap between real and synthetic domains.

Abstract

Model pre-training has become essential in various recognition tasks. Meanwhile, with the remarkable advancements in image generation models, pre-training methods utilizing generated images have also emerged given their ability to produce unlimited training data. However, while existing methods utilizing generated images excel in classification, they fall short in more practical tasks, such as human pose estimation. In this paper, we have experimentally demonstrated it and propose the generation of visually distinct images with identical human poses. We then propose a novel multi-positive contrastive learning, which optimally utilize the previously generated images to learn structural features of the human body. We term the entire learning pipeline as GenPoCCL. Despite using only less than 1% amount of data compared to current state-of-the-art method, GenPoCCL captures structural features of the human body more effectively, surpassing existing methods in a variety of human-centric perception tasks.
Paper Structure (32 sections, 4 equations, 5 figures, 10 tables, 2 algorithms)

This paper contains 32 sections, 4 equations, 5 figures, 10 tables, 2 algorithms.

Figures (5)

  • Figure 1: We compare our method to StableReptian2023stablerep and SynCLRtian2023synclr. Both StableRep and SynCLR use a single prompt to generate semantically similar images which are then treated as positive pairs for contrastive learning. On the other hand, our GenPoCCL takes a step further by generating similar images from a same prompt and same human body pose condition for positive pairs for contrastive learning.
  • Figure 2: Overall pipeline of our method. Utilizing Stable Diffusionrombach2022stablediffusion and T2I-Adaptermou2023t2i, we generate pose-consistent images with varying appearances for contrastive learning from a single prompt and pose condition by altering the initial seed. In pre-training, selected human-part patches are masked following HAPyuan2023hap. We reconstruct images using a shared encoder-decoder, applying reconstruction loss $\mathcal{L}_{rec}$. Alignment of [CLS] tokens is achieved with HAP's alignment loss $\mathcal{L}_{align}$, and we introduce a [POSE] token with a multi-positive contrastive loss $\mathcal{L}_{mp}$ to refine pose and appearance learning.
  • Figure 3: Examples of generated images with input human body pose condition. By varying the initial noise, we succeed in generating pose-consistent, appearance-varied images. We crop the images along with bounding-box labels.
  • Figure 4: StableRep trained on GenCC12M dataset.
  • Figure 5: GenPoCCL trained on GenCOCO dataset.