PoseSyn: Synthesizing Diverse 3D Pose Data from In-the-Wild 2D Data
ChangHee Yang, Hyeonseop Song, Seokhun Choi, Seungwoo Lee, Jaechul Kim, Hoseok Do
TL;DR
PoseSyn addresses the real-world generalization gap in 3D pose Estimation by converting abundant in-the-wild 2D pose data into diverse 3D pose–image pairs without 3D annotations. It introduces an Error Extraction Module (EEM) to identify hard examples and a Motion Synthesis Module (MSM) that, guided by text and an initial pose representation, generates plausible motion sequences; a human animation model then renders realistic training images. Across three target pose estimators and multiple datasets, PoseSyn yields notable improvements (6–14% MPJPE and 5–9% PA-MPJPE on 3D benchmarks; 1–2% on 2D) by enriching training data with diverse appearances, backgrounds, and challenging poses. The approach is scalable and annotation-free, though currently limited to single-person scenarios, with future work aimed at enabling multi-person motion synthesis and interactions.
Abstract
Despite considerable efforts to enhance the generalization of 3D pose estimators without costly 3D annotations, existing data augmentation methods struggle in real world scenarios with diverse human appearances and complex poses. We propose PoseSyn, a novel data synthesis framework that transforms abundant in the wild 2D pose dataset into diverse 3D pose image pairs. PoseSyn comprises two key components: Error Extraction Module (EEM), which identifies challenging poses from the 2D pose datasets, and Motion Synthesis Module (MSM), which synthesizes motion sequences around the challenging poses. Then, by generating realistic 3D training data via a human animation model aligned with challenging poses and appearances PoseSyn boosts the accuracy of various 3D pose estimators by up to 14% across real world benchmarks including various backgrounds and occlusions, challenging poses, and multi view scenarios. Extensive experiments further confirm that PoseSyn is a scalable and effective approach for improving generalization without relying on expensive 3D annotations, regardless of the pose estimator's model size or design.
