Synthesizing Environment-Specific People in Photographs
Mirela Ostrek, Carol O'Sullivan, Michael J. Black, Justus Thies
TL;DR
The paper tackles generating full-body humans that wear scene-appropriate clothing without altering the background in photographs. It introduces ESP, a multi-stage pipeline that conditions HPM-based clothing generation on scene context via a VAE-derived contextual embedding, and on a 2D pose, with an HPM translation module based on Stable Diffusion and ControlNet to guide high-quality inpainting. Key contributions include (i) contextual style vectors for stochastic, context-aware HPM generation with StyleGAN2, (ii) an end-to-end pose-conditioned I2I extension, and (iii) a diffusion-based HPM translator enabling high-resolution inpainting and super-resolution using a pre-trained latent diffusion model. Empirically, ESP outperforms state-of-the-art methods on contextual full-body generation, producing environment-consistent clothing while preserving scene integrity, with robust ablations and perceptual evaluations supporting its effectiveness.
Abstract
We present ESP, a novel method for context-aware full-body generation, that enables photo-realistic synthesis and inpainting of people wearing clothing that is semantically appropriate for the scene depicted in an input photograph. ESP is conditioned on a 2D pose and contextual cues that are extracted from the photograph of the scene and integrated into the generation process, where the clothing is modeled explicitly with human parsing masks (HPM). Generated HPMs are used as tight guiding masks for inpainting, such that no changes are made to the original background. Our models are trained on a dataset containing a set of in-the-wild photographs of people covering a wide range of different environments. The method is analyzed quantitatively and qualitatively, and we show that ESP outperforms the state-of-the-art on the task of contextual full-body generation.
