Table of Contents
Fetching ...

Teleportraits: Training-Free People Insertion into Any Scene

Jialu Gao, K J Joseph, Fernando De La Torre

TL;DR

Teleportraits presents a training-free pipeline that inserts a person into any scene using a single reference image by leveraging pre-trained diffusion models. It combines inversion for scene alignment, high-guidance affordance-aware generation, latent blending for background fidelity, and a mask-guided self-attention mechanism to transfer identity features from the reference. The approach achieves state-of-the-art performance on Text2Place-like data, with comprehensive qualitative, automated, and human evaluations, and runs faster than subject-specific training methods. This work highlights the potential of harnessing the intrinsic semantic knowledge of diffusion models for joint placement and personalization without additional training.

Abstract

The task of realistically inserting a human from a reference image into a background scene is highly challenging, requiring the model to (1) determine the correct location and poses of the person and (2) perform high-quality personalization conditioned on the background. Previous approaches often treat them as separate problems, overlooking their interconnections, and typically rely on training to achieve high performance. In this work, we introduce a unified training-free pipeline that leverages pre-trained text-to-image diffusion models. We show that diffusion models inherently possess the knowledge to place people in complex scenes without requiring task-specific training. By combining inversion techniques with classifier-free guidance, our method achieves affordance-aware global editing, seamlessly inserting people into scenes. Furthermore, our proposed mask-guided self-attention mechanism ensures high-quality personalization, preserving the subject's identity, clothing, and body features from just a single reference image. To the best of our knowledge, we are the first to perform realistic human insertions into scenes in a training-free manner and achieve state-of-the-art results in diverse composite scene images with excellent identity preservation in backgrounds and subjects.

Teleportraits: Training-Free People Insertion into Any Scene

TL;DR

Teleportraits presents a training-free pipeline that inserts a person into any scene using a single reference image by leveraging pre-trained diffusion models. It combines inversion for scene alignment, high-guidance affordance-aware generation, latent blending for background fidelity, and a mask-guided self-attention mechanism to transfer identity features from the reference. The approach achieves state-of-the-art performance on Text2Place-like data, with comprehensive qualitative, automated, and human evaluations, and runs faster than subject-specific training methods. This work highlights the potential of harnessing the intrinsic semantic knowledge of diffusion models for joint placement and personalization without additional training.

Abstract

The task of realistically inserting a human from a reference image into a background scene is highly challenging, requiring the model to (1) determine the correct location and poses of the person and (2) perform high-quality personalization conditioned on the background. Previous approaches often treat them as separate problems, overlooking their interconnections, and typically rely on training to achieve high performance. In this work, we introduce a unified training-free pipeline that leverages pre-trained text-to-image diffusion models. We show that diffusion models inherently possess the knowledge to place people in complex scenes without requiring task-specific training. By combining inversion techniques with classifier-free guidance, our method achieves affordance-aware global editing, seamlessly inserting people into scenes. Furthermore, our proposed mask-guided self-attention mechanism ensures high-quality personalization, preserving the subject's identity, clothing, and body features from just a single reference image. To the best of our knowledge, we are the first to perform realistic human insertions into scenes in a training-free manner and achieve state-of-the-art results in diverse composite scene images with excellent identity preservation in backgrounds and subjects.

Paper Structure

This paper contains 24 sections, 2 equations, 19 figures, 2 tables.

Figures (19)

  • Figure 1: Illustration of Teleportraits. Teleportraits can insert humans into scenes, while maintaining high degree of affordance.
  • Figure 2: Method overview. Teleportraits consists of three steps: (a) Inversion, where we invert the input scene image and reference image into initial latent noise $z_T^S$ and $z_T^R$. This allows Teleportraits to utilize the inherent semantic knowledge of diffusion models to place humans and use the hidden representation of diffusion models to perform personalization. (b) Affordance-Aware Human Generation. Starting from the inverted latent $z_T^S$ of the scene image, Teleportraits uses classifier-free guidance to gradually guide the model to generate a human at reasonable locations with realistic poses following the text prompt. Latent blending is applied at later denoising steps to ensure background fidelity. (c) Mask-Guided Self-Attention. Teleportraits achieves personalization through an extended self-attention mechanism that additionally attend to the keys and values extracted from the recovered diffusion process that reconstruct the reference image.
  • Figure 3: SDXL architecture illustration. SDXL consists of 70 attention layers, where each attention layer includes a cross-attention layer and self-attention layer. In Teleportraits, we apply self-attention based personalization on the up-2, up-3, and up-4 layers, as they determine the color, style, and texture details.
  • Figure 4: Qualitative results. Teleportraits can perform realistic human insertion into various indoor and outdoor scenes given just a single reference image. Results show that Teleportraits is able to reason the location and poses of the inserted human, while preserving the subject identity including hair style, clothing, and body shape.
  • Figure 5: Qualitative comparison with baselines. Using the bounding box from the generated human in Teleportraits, Anydoor is able to insert human but fails to generate realistic human poses that interacts with the scene. Compared to Text2Place, Teleportraits can not only generate better location for human insertion, leading to better inpainting results, but also can preserve the human identity much better.
  • ...and 14 more figures