Synthesizing Environment-Specific People in Photographs

Mirela Ostrek; Carol O'Sullivan; Michael J. Black; Justus Thies

Synthesizing Environment-Specific People in Photographs

Mirela Ostrek, Carol O'Sullivan, Michael J. Black, Justus Thies

TL;DR

The paper tackles generating full-body humans that wear scene-appropriate clothing without altering the background in photographs. It introduces ESP, a multi-stage pipeline that conditions HPM-based clothing generation on scene context via a VAE-derived contextual embedding, and on a 2D pose, with an HPM translation module based on Stable Diffusion and ControlNet to guide high-quality inpainting. Key contributions include (i) contextual style vectors for stochastic, context-aware HPM generation with StyleGAN2, (ii) an end-to-end pose-conditioned I2I extension, and (iii) a diffusion-based HPM translator enabling high-resolution inpainting and super-resolution using a pre-trained latent diffusion model. Empirically, ESP outperforms state-of-the-art methods on contextual full-body generation, producing environment-consistent clothing while preserving scene integrity, with robust ablations and perceptual evaluations supporting its effectiveness.

Abstract

We present ESP, a novel method for context-aware full-body generation, that enables photo-realistic synthesis and inpainting of people wearing clothing that is semantically appropriate for the scene depicted in an input photograph. ESP is conditioned on a 2D pose and contextual cues that are extracted from the photograph of the scene and integrated into the generation process, where the clothing is modeled explicitly with human parsing masks (HPM). Generated HPMs are used as tight guiding masks for inpainting, such that no changes are made to the original background. Our models are trained on a dataset containing a set of in-the-wild photographs of people covering a wide range of different environments. The method is analyzed quantitatively and qualitatively, and we show that ESP outperforms the state-of-the-art on the task of contextual full-body generation.

Synthesizing Environment-Specific People in Photographs

TL;DR

Abstract

Paper Structure (15 sections, 4 equations, 7 figures, 3 tables)

This paper contains 15 sections, 4 equations, 7 figures, 3 tables.

Introduction
Related work
Full-body generation:
Inpainting:
Humans in scenes:
Method
Context embeddings
Contextualized generation of HPMs
HPM translation module
Inpainting and super-resolution
Experiments
Data
Comparison to state of the art
Evaluation and ablation studies
Conclusion

Figures (7)

Figure 1: System overview: (I) an input context image is encoded into the latent space of a VAE, giving context embeddings; (II) the latter is then fed, alongside a random vector, creating a contextual style vector, into a pose-conditioned StyleGAN-II HPM generator; (III) generated HPMs are used as input for pretrained Stable Diffusion/CN modules to achieve fine-grained control over the generated clothing during inpainting.
Figure 2: Analysis of the StyleGAN generator:A. Contextual HPM Generation: Fixing the context (rows F) generates more uniform clothing than varying the context embeddings (rows R). This indicates that there is a link between the context and the environment that has been learned by our context-aware HPM generator. B. Background Reconstruction: GT/VAE reconstructions are shown in rows 1/2; fixing the context vector in the StyleGAN generator gives matching results (row F) with GT/VAE, while varying the context embeddings leads to random predictions (row R).
Figure 3: Comparison against state of the art: we compare (A) StyleGAN-Human, (B) TextHuman, (C) ControlNet (OpenPose) & SD-2.1-Inpainting, with (D) Our full method. (A) is unconditional, while (B) is conditioned on a dense pose. Both do not respect the context - considering clothing semantics, and lighting. Parts of the body may be missing (see (B), col 2). While (C) is conditioned on a pose, the pose is not always respected. It changes the original scene (e.g. by insertion of new objects; see col 3) and it does not always generate single humans (see cols 1, 5). In contrast, Ours (D) does not change the scene, it generates single humans that are posed correctly, and overall contextual alignment, when clothing semantics and lighting are considered, is higher than in (A), (B), and (C). Note: (B), (C), and (D) have the same input pose.
Figure 4: A. Super-Resolution: higher image quality may be achieved by lowering the ControlNet strength via $\beta$. However, when $\beta$ is too low, humans are no longer generated ($\beta=0$ shows the vanilla SD inpainting result without additional HPM guidance). B. Fixed pose: generated clothing changes depending on the context.
Figure 5: ESP outputs full-body humans wearing environment-inspired clothing that is semantically suitable for diverse contexts. We show a wide range of different scenes including indoors and outdoors, with varying image resolutions (from $200 \times 200$ to $512 \times 512$), target pose, and lighting conditions. The produced results depend on the quality of the context image and on the capabilities of the generic Stable Diffusion 2.1 inpainting model that is used in the final step of our system. Please zoom in for details.
...and 2 more figures

Synthesizing Environment-Specific People in Photographs

TL;DR

Abstract

Synthesizing Environment-Specific People in Photographs

Authors

TL;DR

Abstract

Table of Contents

Figures (7)