Table of Contents
Fetching ...

IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation

Yizhi Song, Zhifei Zhang, Zhe Lin, Scott Cohen, Brian Price, Jianming Zhang, Soo Ye Kim, He Zhang, Wei Xiong, Daniel Aliaga

TL;DR

IMPRINT addresses the challenge of identity-preserving generative object compositing by decoupling identity learning from background harmonization through a two-stage diffusion framework. The first stage learns a context-agnostic, view-invariant object representation using a DINOv2-based encoder, while the second stage uses this embedding to harmonize the object with the background and allows shape-guided pose control. It achieves state-of-the-art performance in identity preservation and background harmonization across diverse datasets, validated by quantitative metrics and user studies, and enables flexible shape-guided generation via mask-based controls. The work advances practical, editable compositing with diffusion models and points to future 3D-aware extensions and higher-resolution latent encoders for even better fidelity.

Abstract

Generative object compositing emerges as a promising new avenue for compositional image editing. However, the requirement of object identity preservation poses a significant challenge, limiting practical usage of most existing methods. In response, this paper introduces IMPRINT, a novel diffusion-based generative model trained with a two-stage learning framework that decouples learning of identity preservation from that of compositing. The first stage is targeted for context-agnostic, identity-preserving pretraining of the object encoder, enabling the encoder to learn an embedding that is both view-invariant and conducive to enhanced detail preservation. The subsequent stage leverages this representation to learn seamless harmonization of the object composited to the background. In addition, IMPRINT incorporates a shape-guidance mechanism offering user-directed control over the compositing process. Extensive experiments demonstrate that IMPRINT significantly outperforms existing methods and various baselines on identity preservation and composition quality.

IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation

TL;DR

IMPRINT addresses the challenge of identity-preserving generative object compositing by decoupling identity learning from background harmonization through a two-stage diffusion framework. The first stage learns a context-agnostic, view-invariant object representation using a DINOv2-based encoder, while the second stage uses this embedding to harmonize the object with the background and allows shape-guided pose control. It achieves state-of-the-art performance in identity preservation and background harmonization across diverse datasets, validated by quantitative metrics and user studies, and enables flexible shape-guided generation via mask-based controls. The work advances practical, editable compositing with diffusion models and points to future 3D-aware extensions and higher-resolution latent encoders for even better fidelity.

Abstract

Generative object compositing emerges as a promising new avenue for compositional image editing. However, the requirement of object identity preservation poses a significant challenge, limiting practical usage of most existing methods. In response, this paper introduces IMPRINT, a novel diffusion-based generative model trained with a two-stage learning framework that decouples learning of identity preservation from that of compositing. The first stage is targeted for context-agnostic, identity-preserving pretraining of the object encoder, enabling the encoder to learn an embedding that is both view-invariant and conducive to enhanced detail preservation. The subsequent stage leverages this representation to learn seamless harmonization of the object composited to the background. In addition, IMPRINT incorporates a shape-guidance mechanism offering user-directed control over the compositing process. Extensive experiments demonstrate that IMPRINT significantly outperforms existing methods and various baselines on identity preservation and composition quality.
Paper Structure (30 sections, 2 equations, 17 figures, 8 tables)

This paper contains 30 sections, 2 equations, 17 figures, 8 tables.

Figures (17)

  • Figure 1: Top: Comparison with three prior works, i.e., Paint-by-Example yang2023paint, ObjectStitch song2023objectstitch, and TF-ICON lu2023tf. Our method IMPRINT outperforms others in terms of identity preservation and color/geometry harmonization. Bottom: Given a coarse mask, IMPRINT can change the pose of the object to follow the shape of the mask.
  • Figure 2: The two-stage training pipeline of the proposed IMPRINT.
  • Figure 3: Illustration of the background-blending process. At each denoising step, the background area of the denoised latent is masked and blended with unmasked area from the clean background (intuitively, the model is only denoising the foreground).
  • Figure 4: Illustration of the data augmentation pipeline.
  • Figure 5: Qualitative comparison on the DreamBooth test set. Paint-by-Example and ObjectStitch lose most object details and only maintain categorical information. TF-ICON tends to copy the pose of the input subject. The comparison highlights the advantage of IMPRINT in keeping identity and making geometric changes.
  • ...and 12 more figures