One-Shot Learning for Pose-Guided Person Image Synthesis in the Wild
Dongqi Fan, Tao Chen, Mingjie Wang, Rui Ma, Qiang Tang, Zili Yi, Qian Wang, Liang Chang
TL;DR
Pose-Guided Person Image Synthesis (PGPIS) methods often overfit to studio-like data and struggle with in-the-wild samples due to distribution gaps. This paper introduces OnePoseTrans, a one-shot, test-time tuning framework that adapts a pre-trained text-to-image model to a single source image using a Visual Consistency Module to align face, text, and image embeddings, along with style and weight injections in the diffusion process. The approach achieves rapid per-image customization (approximately 48 seconds on a V100) and shows improved generalization across wild domains, outperforming baselines on WPose and matching state-of-the-art results on challenging DeepFashion subsets. These findings demonstrate a practical pathway for robust, per-person pose transfer in-the-wild, reducing reliance on large, diverse training datasets while maintaining high visual fidelity.
Abstract
Current Pose-Guided Person Image Synthesis (PGPIS) methods depend heavily on large amounts of labeled triplet data to train the generator in a supervised manner. However, they often falter when applied to in-the-wild samples, primarily due to the distribution gap between the training datasets and real-world test samples. While some researchers aim to enhance model generalizability through sophisticated training procedures, advanced architectures, or by creating more diverse datasets, we adopt the test-time fine-tuning paradigm to customize a pre-trained Text2Image (T2I) model. However, naively applying test-time tuning results in inconsistencies in facial identities and appearance attributes. To address this, we introduce a Visual Consistency Module (VCM), which enhances appearance consistency by combining the face, text, and image embedding. Our approach, named OnePoseTrans, requires only a single source image to generate high-quality pose transfer results, offering greater stability than state-of-the-art data-driven methods. For each test case, OnePoseTrans customizes a model in around 48 seconds with an NVIDIA V100 GPU.
