Table of Contents
Fetching ...

UniHuman: A Unified Model for Editing Human Images in the Wild

Nannan Li, Qing Liu, Krishna Kumar Singh, Yilin Wang, Jianming Zhang, Bryan A. Plummer, Zhe Lin

TL;DR

UniHuman tackles the problem of editing human images in the wild by unifying reposing, virtual try-on, and text manipulation within a diffusion-based framework that leverages a Part Encoder, a Pose-Warping Module, and a Conditioning Encoder. The pose-warping component maps source textures to target poses using dense UV and sparse keypoint correspondences, enabling robust texture transfer across all tasks. A large-scale LH-400K dataset (400K image-text pairs) plus out-of-domain test sets (WPose, WVTON) bolster generalization to real-world variations, and objective functions L_SD, L_B, and L_E guide cross-attention to maintain texture fidelity. Empirical results show UniHuman outperforms task-specific baselines on in-domain and out-of-domain data, with user studies reporting up to 77% preference, and the approach is poised to impact practical applications requiring versatile, texture-consistent human editing. Future work includes extending the framework to video while further refining pose estimation and 3D cues to mitigate failure modes.$

Abstract

Human image editing includes tasks like changing a person's pose, their clothing, or editing the image according to a text prompt. However, prior work often tackles these tasks separately, overlooking the benefit of mutual reinforcement from learning them jointly. In this paper, we propose UniHuman, a unified model that addresses multiple facets of human image editing in real-world settings. To enhance the model's generation quality and generalization capacity, we leverage guidance from human visual encoders and introduce a lightweight pose-warping module that can exploit different pose representations, accommodating unseen textures and patterns. Furthermore, to bridge the disparity between existing human editing benchmarks with real-world data, we curated 400K high-quality human image-text pairs for training and collected 2K human images for out-of-domain testing, both encompassing diverse clothing styles, backgrounds, and age groups. Experiments on both in-domain and out-of-domain test sets demonstrate that UniHuman outperforms task-specific models by a significant margin. In user studies, UniHuman is preferred by the users in an average of 77% of cases. Our project is available at https://github.com/NannanLi999/UniHuman.

UniHuman: A Unified Model for Editing Human Images in the Wild

TL;DR

UniHuman tackles the problem of editing human images in the wild by unifying reposing, virtual try-on, and text manipulation within a diffusion-based framework that leverages a Part Encoder, a Pose-Warping Module, and a Conditioning Encoder. The pose-warping component maps source textures to target poses using dense UV and sparse keypoint correspondences, enabling robust texture transfer across all tasks. A large-scale LH-400K dataset (400K image-text pairs) plus out-of-domain test sets (WPose, WVTON) bolster generalization to real-world variations, and objective functions L_SD, L_B, and L_E guide cross-attention to maintain texture fidelity. Empirical results show UniHuman outperforms task-specific baselines on in-domain and out-of-domain data, with user studies reporting up to 77% preference, and the approach is poised to impact practical applications requiring versatile, texture-consistent human editing. Future work includes extending the framework to video while further refining pose estimation and 3D cues to mitigate failure modes.$

Abstract

Human image editing includes tasks like changing a person's pose, their clothing, or editing the image according to a text prompt. However, prior work often tackles these tasks separately, overlooking the benefit of mutual reinforcement from learning them jointly. In this paper, we propose UniHuman, a unified model that addresses multiple facets of human image editing in real-world settings. To enhance the model's generation quality and generalization capacity, we leverage guidance from human visual encoders and introduce a lightweight pose-warping module that can exploit different pose representations, accommodating unseen textures and patterns. Furthermore, to bridge the disparity between existing human editing benchmarks with real-world data, we curated 400K high-quality human image-text pairs for training and collected 2K human images for out-of-domain testing, both encompassing diverse clothing styles, backgrounds, and age groups. Experiments on both in-domain and out-of-domain test sets demonstrate that UniHuman outperforms task-specific models by a significant margin. In user studies, UniHuman is preferred by the users in an average of 77% of cases. Our project is available at https://github.com/NannanLi999/UniHuman.
Paper Structure (42 sections, 9 equations, 14 figures, 9 tables)

This paper contains 42 sections, 9 equations, 14 figures, 9 tables.

Figures (14)

  • Figure 1: The results of UniHuman on diverse real-world images. UniHuman learns informative representations by leveraging multiple data sources and connections between related tasks, achieving high-quality results across various human image editing objectives.
  • Figure 2: An overview of our model. (a) Our inference pipeline. Starting from a noise latent code, our model edits the source person given the source image, the target pose, the visual prompt (optional), and the text prompt (optional). Blue arrow is the reposing flow, which is also the base flow for all tasks. Pink dashed arrow indicates the optional virtual try-on flow that takes a clothing image as its input. In try-on task, the clothing image should replace the source image as the input to the pose-warping module. Brown dashed arrow is the optional text manipulation flow, which accepts a text description as its prompt. (b) The introduced pose-warping module. It maps the original RGB pixels of the source texture to the target pose based on pose correspondences. Best view in color.
  • Figure 3: Representative examples from different datasets. Our LH-400K includes people of diverse ages and backgrounds.
  • Figure 4: Visualized results of reposing (256x256). Our model transfers the texture patterns better, particularly in out-of-domain samples. More results can be found in Supp.
  • Figure 5: Virtual try-on results (512 x 512). Our UniHuman better recovers the intricate details in the target garment, particularly in out-of-domain samples. More results can be found in Supp.
  • ...and 9 more figures