UniHuman: A Unified Model for Editing Human Images in the Wild
Nannan Li, Qing Liu, Krishna Kumar Singh, Yilin Wang, Jianming Zhang, Bryan A. Plummer, Zhe Lin
TL;DR
UniHuman tackles the problem of editing human images in the wild by unifying reposing, virtual try-on, and text manipulation within a diffusion-based framework that leverages a Part Encoder, a Pose-Warping Module, and a Conditioning Encoder. The pose-warping component maps source textures to target poses using dense UV and sparse keypoint correspondences, enabling robust texture transfer across all tasks. A large-scale LH-400K dataset (400K image-text pairs) plus out-of-domain test sets (WPose, WVTON) bolster generalization to real-world variations, and objective functions L_SD, L_B, and L_E guide cross-attention to maintain texture fidelity. Empirical results show UniHuman outperforms task-specific baselines on in-domain and out-of-domain data, with user studies reporting up to 77% preference, and the approach is poised to impact practical applications requiring versatile, texture-consistent human editing. Future work includes extending the framework to video while further refining pose estimation and 3D cues to mitigate failure modes.$
Abstract
Human image editing includes tasks like changing a person's pose, their clothing, or editing the image according to a text prompt. However, prior work often tackles these tasks separately, overlooking the benefit of mutual reinforcement from learning them jointly. In this paper, we propose UniHuman, a unified model that addresses multiple facets of human image editing in real-world settings. To enhance the model's generation quality and generalization capacity, we leverage guidance from human visual encoders and introduce a lightweight pose-warping module that can exploit different pose representations, accommodating unseen textures and patterns. Furthermore, to bridge the disparity between existing human editing benchmarks with real-world data, we curated 400K high-quality human image-text pairs for training and collected 2K human images for out-of-domain testing, both encompassing diverse clothing styles, backgrounds, and age groups. Experiments on both in-domain and out-of-domain test sets demonstrate that UniHuman outperforms task-specific models by a significant margin. In user studies, UniHuman is preferred by the users in an average of 77% of cases. Our project is available at https://github.com/NannanLi999/UniHuman.
