Table of Contents
Fetching ...

One-Shot Learning for Pose-Guided Person Image Synthesis in the Wild

Dongqi Fan, Tao Chen, Mingjie Wang, Rui Ma, Qiang Tang, Zili Yi, Qian Wang, Liang Chang

TL;DR

Pose-Guided Person Image Synthesis (PGPIS) methods often overfit to studio-like data and struggle with in-the-wild samples due to distribution gaps. This paper introduces OnePoseTrans, a one-shot, test-time tuning framework that adapts a pre-trained text-to-image model to a single source image using a Visual Consistency Module to align face, text, and image embeddings, along with style and weight injections in the diffusion process. The approach achieves rapid per-image customization (approximately 48 seconds on a V100) and shows improved generalization across wild domains, outperforming baselines on WPose and matching state-of-the-art results on challenging DeepFashion subsets. These findings demonstrate a practical pathway for robust, per-person pose transfer in-the-wild, reducing reliance on large, diverse training datasets while maintaining high visual fidelity.

Abstract

Current Pose-Guided Person Image Synthesis (PGPIS) methods depend heavily on large amounts of labeled triplet data to train the generator in a supervised manner. However, they often falter when applied to in-the-wild samples, primarily due to the distribution gap between the training datasets and real-world test samples. While some researchers aim to enhance model generalizability through sophisticated training procedures, advanced architectures, or by creating more diverse datasets, we adopt the test-time fine-tuning paradigm to customize a pre-trained Text2Image (T2I) model. However, naively applying test-time tuning results in inconsistencies in facial identities and appearance attributes. To address this, we introduce a Visual Consistency Module (VCM), which enhances appearance consistency by combining the face, text, and image embedding. Our approach, named OnePoseTrans, requires only a single source image to generate high-quality pose transfer results, offering greater stability than state-of-the-art data-driven methods. For each test case, OnePoseTrans customizes a model in around 48 seconds with an NVIDIA V100 GPU.

One-Shot Learning for Pose-Guided Person Image Synthesis in the Wild

TL;DR

Pose-Guided Person Image Synthesis (PGPIS) methods often overfit to studio-like data and struggle with in-the-wild samples due to distribution gaps. This paper introduces OnePoseTrans, a one-shot, test-time tuning framework that adapts a pre-trained text-to-image model to a single source image using a Visual Consistency Module to align face, text, and image embeddings, along with style and weight injections in the diffusion process. The approach achieves rapid per-image customization (approximately 48 seconds on a V100) and shows improved generalization across wild domains, outperforming baselines on WPose and matching state-of-the-art results on challenging DeepFashion subsets. These findings demonstrate a practical pathway for robust, per-person pose transfer in-the-wild, reducing reliance on large, diverse training datasets while maintaining high visual fidelity.

Abstract

Current Pose-Guided Person Image Synthesis (PGPIS) methods depend heavily on large amounts of labeled triplet data to train the generator in a supervised manner. However, they often falter when applied to in-the-wild samples, primarily due to the distribution gap between the training datasets and real-world test samples. While some researchers aim to enhance model generalizability through sophisticated training procedures, advanced architectures, or by creating more diverse datasets, we adopt the test-time fine-tuning paradigm to customize a pre-trained Text2Image (T2I) model. However, naively applying test-time tuning results in inconsistencies in facial identities and appearance attributes. To address this, we introduce a Visual Consistency Module (VCM), which enhances appearance consistency by combining the face, text, and image embedding. Our approach, named OnePoseTrans, requires only a single source image to generate high-quality pose transfer results, offering greater stability than state-of-the-art data-driven methods. For each test case, OnePoseTrans customizes a model in around 48 seconds with an NVIDIA V100 GPU.
Paper Structure (17 sections, 4 equations, 8 figures, 2 tables)

This paper contains 17 sections, 4 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Visual comparisons across three different image domains show that OnePoseTrans delivers comparable results to CFLD and PCDMs on the DeepFashion dataset, while achieving the best visual quality on the WPose dataset. Additionally, OnePoseTrans demonstrates strong generalization ability in-the-wild scenarios.
  • Figure 2: Existing methods that are trained on the DeepFashion dataset often exhibit varying degrees of overfitting, despite Unihuman utilizing datasets $\sim$10 times larger than DeepFashion as an additional training dataset. To align with the target image, (row 1) the generated image selectively omits the coat, sunglasses, and other items, (row 2) the background of the generated image alters the background of the source image, (row 3) given only a set of clothing, the model can accurately infer the appearance of a person with their back turned, and (row 4) the model can accurately deduce the clothing.
  • Figure 3: The presence of duplicate persons and clothing items and the limited diversity of the studio background within DeepFashion pose a significant challenge for data-driven supervised training.
  • Figure 4: (Left) The OnePoseTrans pipeline. During the one-shot tuning stage, only the T2I model, and a portion of the Visual Consistency Module (VCM) are trainable. (Right) The details of the VCM. Note that the $<$face$>$ token replacement is applied exclusively to the value token, without modifying the key and query tokens.
  • Figure 5: Visual comparisons with state-of-the-art models on the DeepFashion dataset. Our OnePoseTrans achieves comparable visual results to these models.
  • ...and 3 more figures