Table of Contents
Fetching ...

Scene Aware Person Image Generation through Global Contextual Conditioning

Prasun Roy, Subhankar Ghosh, Saumik Bhattacharya, Umapada Pal, Michael Blumenstein

TL;DR

This work tackles scene-aware person image generation by introducing a three-stage pipeline that first encodes global scene context as an 18-channel heatmap and uses a Wasserstein GAN to predict a context-consistent target pose/location. A pose refinement stage improves facial keypoints, followed by a multi-scale attention-guided pose transfer that generates the final image conditioned on the target person’s image, achieving high-resolution, context-preserving results. Quantitative metrics show strong perceptual quality (LPIPS) and pose accuracy (PCKh) compared to baselines, while qualitative results demonstrate realistic blending with existing scene people. The approach enables flexible insertion of individuals into complex scenes and has practical implications for virtual try-on, pose transfer, and scene composition, though it faces challenges in crowded scenes and depends on robust pose approximation.

Abstract

Person image generation is an intriguing yet challenging problem. However, this task becomes even more difficult under constrained situations. In this work, we propose a novel pipeline to generate and insert contextually relevant person images into an existing scene while preserving the global semantics. More specifically, we aim to insert a person such that the location, pose, and scale of the person being inserted blends in with the existing persons in the scene. Our method uses three individual networks in a sequential pipeline. At first, we predict the potential location and the skeletal structure of the new person by conditioning a Wasserstein Generative Adversarial Network (WGAN) on the existing human skeletons present in the scene. Next, the predicted skeleton is refined through a shallow linear network to achieve higher structural accuracy in the generated image. Finally, the target image is generated from the refined skeleton using another generative network conditioned on a given image of the target person. In our experiments, we achieve high-resolution photo-realistic generation results while preserving the general context of the scene. We conclude our paper with multiple qualitative and quantitative benchmarks on the results.

Scene Aware Person Image Generation through Global Contextual Conditioning

TL;DR

This work tackles scene-aware person image generation by introducing a three-stage pipeline that first encodes global scene context as an 18-channel heatmap and uses a Wasserstein GAN to predict a context-consistent target pose/location. A pose refinement stage improves facial keypoints, followed by a multi-scale attention-guided pose transfer that generates the final image conditioned on the target person’s image, achieving high-resolution, context-preserving results. Quantitative metrics show strong perceptual quality (LPIPS) and pose accuracy (PCKh) compared to baselines, while qualitative results demonstrate realistic blending with existing scene people. The approach enables flexible insertion of individuals into complex scenes and has practical implications for virtual try-on, pose transfer, and scene composition, though it faces challenges in crowded scenes and depends on robust pose approximation.

Abstract

Person image generation is an intriguing yet challenging problem. However, this task becomes even more difficult under constrained situations. In this work, we propose a novel pipeline to generate and insert contextually relevant person images into an existing scene while preserving the global semantics. More specifically, we aim to insert a person such that the location, pose, and scale of the person being inserted blends in with the existing persons in the scene. Our method uses three individual networks in a sequential pipeline. At first, we predict the potential location and the skeletal structure of the new person by conditioning a Wasserstein Generative Adversarial Network (WGAN) on the existing human skeletons present in the scene. Next, the predicted skeleton is refined through a shallow linear network to achieve higher structural accuracy in the generated image. Finally, the target image is generated from the refined skeleton using another generative network conditioned on a given image of the target person. In our experiments, we achieve high-resolution photo-realistic generation results while preserving the general context of the scene. We conclude our paper with multiple qualitative and quantitative benchmarks on the results.
Paper Structure (10 sections, 10 equations, 5 figures, 1 table)

This paper contains 10 sections, 10 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Architecture of the proposed pipeline. The workflow is disentangled among three sequential stages. In stage 1, an approximation of the target pose is estimated by sampling from a Gaussian distribution conditioned over the global geometric context. Next, the crude representation is refined by regression in stage 2. Finally, the pose transfer is carried out by conditioning over the source image in stage 3.
  • Figure 2: Qualitative results of pose approximation followed by pose refinement. Due to spatial location and pose uncertainty, the target pose may look different from the ground truth. However, it does not affect the generation performance as long as the global geometric context is preserved and the target person blends in with the existing persons in the scene.
  • Figure 3: Qualitative results generated by the proposed pipeline.
  • Figure 4: Effects of intermediate pose refinement on generated images.
  • Figure 5: Examples of failure cases.