Visual Persona: Foundation Model for Full-Body Human Customization
Jisu Nam, Soowon Son, Zhan Xu, Jing Shi, Difan Liu, Feng Liu, Aashish Misraa, Seungryong Kim, Yang Zhou
TL;DR
Visual Persona tackles full-body text-to-image customization from a single real image by introducing a region-aware transformer architecture that produces dense identity embeddings to condition a frozen diffusion model. It overcomes data scarcity with Visual Persona-500K, a large paired full-body dataset curated via vision-language models and Phi-3 captions to separate identity from intra-individual variation. The method partitions the input into five body regions, encodes local appearance with DINOv2, and uses a body-part transformer to generate dense identity embeddings that guide diffusion with decoupled identity cross-attention alongside text prompts, achieving high-fidelity, identity-consistent outputs across diverse poses and scenes. Evaluations on GPT-based Dreambench++ and human studies show Visual Persona outperforms prior methods in identity preservation and text alignment, while enabling versatile applications such as part-guided generation, multi-person customization, and text-guided virtual try-on; limitations include body-proportion inaccuracies and leakage of identity-unrelated background cues, suggesting avenues for improved foreground masking and anatomy fidelity.
Abstract
We introduce Visual Persona, a foundation model for text-to-image full-body human customization that, given a single in-the-wild human image, generates diverse images of the individual guided by text descriptions. Unlike prior methods that focus solely on preserving facial identity, our approach captures detailed full-body appearance, aligning with text descriptions for body structure and scene variations. Training this model requires large-scale paired human data, consisting of multiple images per individual with consistent full-body identities, which is notoriously difficult to obtain. To address this, we propose a data curation pipeline leveraging vision-language models to evaluate full-body appearance consistency, resulting in Visual Persona-500K, a dataset of 580k paired human images across 100k unique identities. For precise appearance transfer, we introduce a transformer encoder-decoder architecture adapted to a pre-trained text-to-image diffusion model, which augments the input image into distinct body regions, encodes these regions as local appearance features, and projects them into dense identity embeddings independently to condition the diffusion model for synthesizing customized images. Visual Persona consistently surpasses existing approaches, generating high-quality, customized images from in-the-wild inputs. Extensive ablation studies validate design choices, and we demonstrate the versatility of Visual Persona across various downstream tasks.
