Table of Contents
Fetching ...

PoseSyn: Synthesizing Diverse 3D Pose Data from In-the-Wild 2D Data

ChangHee Yang, Hyeonseop Song, Seokhun Choi, Seungwoo Lee, Jaechul Kim, Hoseok Do

TL;DR

PoseSyn addresses the real-world generalization gap in 3D pose Estimation by converting abundant in-the-wild 2D pose data into diverse 3D pose–image pairs without 3D annotations. It introduces an Error Extraction Module (EEM) to identify hard examples and a Motion Synthesis Module (MSM) that, guided by text and an initial pose representation, generates plausible motion sequences; a human animation model then renders realistic training images. Across three target pose estimators and multiple datasets, PoseSyn yields notable improvements (6–14% MPJPE and 5–9% PA-MPJPE on 3D benchmarks; 1–2% on 2D) by enriching training data with diverse appearances, backgrounds, and challenging poses. The approach is scalable and annotation-free, though currently limited to single-person scenarios, with future work aimed at enabling multi-person motion synthesis and interactions.

Abstract

Despite considerable efforts to enhance the generalization of 3D pose estimators without costly 3D annotations, existing data augmentation methods struggle in real world scenarios with diverse human appearances and complex poses. We propose PoseSyn, a novel data synthesis framework that transforms abundant in the wild 2D pose dataset into diverse 3D pose image pairs. PoseSyn comprises two key components: Error Extraction Module (EEM), which identifies challenging poses from the 2D pose datasets, and Motion Synthesis Module (MSM), which synthesizes motion sequences around the challenging poses. Then, by generating realistic 3D training data via a human animation model aligned with challenging poses and appearances PoseSyn boosts the accuracy of various 3D pose estimators by up to 14% across real world benchmarks including various backgrounds and occlusions, challenging poses, and multi view scenarios. Extensive experiments further confirm that PoseSyn is a scalable and effective approach for improving generalization without relying on expensive 3D annotations, regardless of the pose estimator's model size or design.

PoseSyn: Synthesizing Diverse 3D Pose Data from In-the-Wild 2D Data

TL;DR

PoseSyn addresses the real-world generalization gap in 3D pose Estimation by converting abundant in-the-wild 2D pose data into diverse 3D pose–image pairs without 3D annotations. It introduces an Error Extraction Module (EEM) to identify hard examples and a Motion Synthesis Module (MSM) that, guided by text and an initial pose representation, generates plausible motion sequences; a human animation model then renders realistic training images. Across three target pose estimators and multiple datasets, PoseSyn yields notable improvements (6–14% MPJPE and 5–9% PA-MPJPE on 3D benchmarks; 1–2% on 2D) by enriching training data with diverse appearances, backgrounds, and challenging poses. The approach is scalable and annotation-free, though currently limited to single-person scenarios, with future work aimed at enabling multi-person motion synthesis and interactions.

Abstract

Despite considerable efforts to enhance the generalization of 3D pose estimators without costly 3D annotations, existing data augmentation methods struggle in real world scenarios with diverse human appearances and complex poses. We propose PoseSyn, a novel data synthesis framework that transforms abundant in the wild 2D pose dataset into diverse 3D pose image pairs. PoseSyn comprises two key components: Error Extraction Module (EEM), which identifies challenging poses from the 2D pose datasets, and Motion Synthesis Module (MSM), which synthesizes motion sequences around the challenging poses. Then, by generating realistic 3D training data via a human animation model aligned with challenging poses and appearances PoseSyn boosts the accuracy of various 3D pose estimators by up to 14% across real world benchmarks including various backgrounds and occlusions, challenging poses, and multi view scenarios. Extensive experiments further confirm that PoseSyn is a scalable and effective approach for improving generalization without relying on expensive 3D annotations, regardless of the pose estimator's model size or design.

Paper Structure

This paper contains 26 sections, 11 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: PoseSyn's Approach for Corner-Case Pose Generation. PoseSyn addresses challenging pose cases without direct costly 3D annotations. Starting from a pseudo-labeled pose (blue circle) that is inaccurate compared to the unknown true 3D pose (red star), PoseSyn generates diverse motion sequences (green dotted lines) around the pseudo-labeled pose to produce new image–pose pairs (numbered 1--4). PoseSyn effectively bridges the gap between the inaccurate pseudo-label and the desired challenging pose, expanding the training distribution with hard examples.
  • Figure 2: Proposed Framework for 3D Pose Data Synthesis. Our framework begins by identifying challenging and non-challenging images in a 2D pose dataset using the Error Extraction Module (EEM). EEM isolates poses with high error from the Target Pose Estimator (TPE) as challenging data, which are then processed in the Motion Synthesis Module (MSM) to generate complex motion sequences. Subsequently, a motion-guided video generation model creates synthetic training data using non-challenging images as references, which then undergoes a filtering process to finally train the TPE.
  • Figure 3: EEM Results. The proposed EEM identifies (a) challenging and (b) non-challenging data from in-the-wild 2D pose dataset. Challenging data includes intricate and dynamic poses, whereas non-challenging data consists of stationary and static poses, demonstrating the effectiveness of the EEM.
  • Figure 4: SMG Architecture. Proposed Semantic-guided Motion Generation (SMG) augments a mis-predicted pose into motion sequences. An initial motion representation $\mathcal{MR}_{\text{init}}$ is encoded and mapped into the codebook to produce initial motion indices $\mathcal{S}_{\mathcal{MR}}$. The transformer takes text embeddings $\mathbf{e}_{\text{text}}$ and $\mathcal{S}_{\mathcal{MR}}$ both as input and generates motion indices for motion sequence $\mathcal{M}_{\text{C}}$.
  • Figure 5: Qualitative Comparison with Baselines. 3DCrowdNet trained with only real data exhibits inaccurate 3D pose predictions with limited generalization capability. Baselines (i.e., PoseGen and Ours-N) show limited performance gains, but our approach achieves more accurate pose predictions across diverse real-world datasets with various backgrounds and occlusions, challenging poses, and multi-views. Red boxes highlight areas of incorrect predictions in models trained with real-only data and baseline methods.
  • ...and 11 more figures