Table of Contents
Fetching ...

Semantically Consistent Person Image Generation

Prasun Roy, Saumik Bhattacharya, Subhankar Ghosh, Umapada Pal, Michael Blumenstein

TL;DR

This work tackles scene-aware person image generation by inserting a target into a complex scene while preserving global context. It introduces a three-stage pipeline: a Pix2PixHD-based coarse semantic map estimator, a data-driven refinement that selects a near-match from a clustered semantic-map knowledge base, and an exemplar-driven, multi-scale attention renderer for appearance transfer. Key contributions include a clustering-based refinement to improve realism and diversity, a pose-conditioned rendering framework with robust perceptual and adversarial losses, and extensive ablations validating the importance of each stage. The approach enables realistic, controllable person insertion in cluttered scenes with demonstrated improvements over several baselines and rich qualitative results, offering practical utility for augmented reality and video synthesis applications.

Abstract

We propose a data-driven approach for context-aware person image generation. Specifically, we attempt to generate a person image such that the synthesized instance can blend into a complex scene. In our method, the position, scale, and appearance of the generated person are semantically conditioned on the existing persons in the scene. The proposed technique is divided into three sequential steps. At first, we employ a Pix2PixHD model to infer a coarse semantic mask that represents the new person's spatial location, scale, and potential pose. Next, we use a data-centric approach to select the closest representation from a precomputed cluster of fine semantic masks. Finally, we adopt a multi-scale, attention-guided architecture to transfer the appearance attributes from an exemplar image. The proposed strategy enables us to synthesize semantically coherent realistic persons that can blend into an existing scene without altering the global context. We conclude our findings with relevant qualitative and quantitative evaluations.

Semantically Consistent Person Image Generation

TL;DR

This work tackles scene-aware person image generation by inserting a target into a complex scene while preserving global context. It introduces a three-stage pipeline: a Pix2PixHD-based coarse semantic map estimator, a data-driven refinement that selects a near-match from a clustered semantic-map knowledge base, and an exemplar-driven, multi-scale attention renderer for appearance transfer. Key contributions include a clustering-based refinement to improve realism and diversity, a pose-conditioned rendering framework with robust perceptual and adversarial losses, and extensive ablations validating the importance of each stage. The approach enables realistic, controllable person insertion in cluttered scenes with demonstrated improvements over several baselines and rich qualitative results, offering practical utility for augmented reality and video synthesis applications.

Abstract

We propose a data-driven approach for context-aware person image generation. Specifically, we attempt to generate a person image such that the synthesized instance can blend into a complex scene. In our method, the position, scale, and appearance of the generated person are semantically conditioned on the existing persons in the scene. The proposed technique is divided into three sequential steps. At first, we employ a Pix2PixHD model to infer a coarse semantic mask that represents the new person's spatial location, scale, and potential pose. Next, we use a data-centric approach to select the closest representation from a precomputed cluster of fine semantic masks. Finally, we adopt a multi-scale, attention-guided architecture to transfer the appearance attributes from an exemplar image. The proposed strategy enables us to synthesize semantically coherent realistic persons that can blend into an existing scene without altering the global context. We conclude our findings with relevant qualitative and quantitative evaluations.
Paper Structure (14 sections, 9 equations, 13 figures, 4 tables)

This paper contains 14 sections, 9 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Overview of the proposed method. (a) Original scene. (b) Semantic maps of existing persons in the scene. (c) Coarse estimation of the target person's location, scale, and potential pose. (d) Data-driven refinement of the coarse semantic map. (e) An exemplar of the target person. (f) Generated scene with the rendered target person.
  • Figure 2: The architecture of the proposed method consists of three main stages. (a) Coarse semantic map estimation from the global scene context in stage 1. (b) Data-driven refinement of the initially estimated coarse semantic map in stage 2. (c) Rendering the refined semantic map by transferring appearance attributes from an exemplar in stage 3.
  • Figure 3: Qualitative results of the coarse generation in stage 1. Semantic maps of existing persons are marked in gray, and the coarse estimation of the target semantic map is marked in purple.
  • Figure 4: Qualitative results generated by the proposed method. Each set of examples shows -- the original scene (left), an exemplar of the target person (middle), and the final generated scene (right).
  • Figure 5: Qualitative results of refinement in stage 2. The first column shows a coarse semantic map as the query, and the following columns show the top-5 refined semantic maps retrieved for both genders. The cosine similarity score for each retrieval is shown below the respective sample. (Best viewed with 400% zoom)
  • ...and 8 more figures