Table of Contents
Fetching ...

GroupDiff: Diffusion-based Group Portrait Editing

Yuming Jiang, Nanxuan Zhao, Qing Liu, Krishna Kumar Singh, Shuai Yang, Chen Change Loy, Ziwei Liu

TL;DR

This work presents GroupDiff, a pioneering effort to tackle group photo editing with three dedicated contributions, which offers controllability for editing and maintains the fidelity of the original photos.

Abstract

Group portrait editing is highly desirable since users constantly want to add a person, delete a person, or manipulate existing persons. It is also challenging due to the intricate dynamics of human interactions and the diverse gestures. In this work, we present GroupDiff, a pioneering effort to tackle group photo editing with three dedicated contributions: 1) Data Engine: Since there is no labeled data for group photo editing, we create a data engine to generate paired data for training. The training data engine covers the diverse needs of group portrait editing. 2) Appearance Preservation: To keep the appearance consistent after editing, we inject the images of persons from the group photo into the attention modules and employ skeletons to provide intra-person guidance. 3) Control Flexibility: Bounding boxes indicating the locations of each person are used to reweight the attention matrix so that the features of each person can be injected into the correct places. This inter-person guidance provides flexible manners for manipulation. Extensive experiments demonstrate that GroupDiff exhibits state-of-the-art performance compared to existing methods. GroupDiff offers controllability for editing and maintains the fidelity of the original photos.

GroupDiff: Diffusion-based Group Portrait Editing

TL;DR

This work presents GroupDiff, a pioneering effort to tackle group photo editing with three dedicated contributions, which offers controllability for editing and maintains the fidelity of the original photos.

Abstract

Group portrait editing is highly desirable since users constantly want to add a person, delete a person, or manipulate existing persons. It is also challenging due to the intricate dynamics of human interactions and the diverse gestures. In this work, we present GroupDiff, a pioneering effort to tackle group photo editing with three dedicated contributions: 1) Data Engine: Since there is no labeled data for group photo editing, we create a data engine to generate paired data for training. The training data engine covers the diverse needs of group portrait editing. 2) Appearance Preservation: To keep the appearance consistent after editing, we inject the images of persons from the group photo into the attention modules and employ skeletons to provide intra-person guidance. 3) Control Flexibility: Bounding boxes indicating the locations of each person are used to reweight the attention matrix so that the features of each person can be injected into the correct places. This inter-person guidance provides flexible manners for manipulation. Extensive experiments demonstrate that GroupDiff exhibits state-of-the-art performance compared to existing methods. GroupDiff offers controllability for editing and maintains the fidelity of the original photos.
Paper Structure (15 sections, 7 equations, 13 figures, 2 tables)

This paper contains 15 sections, 7 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Applications enabled by our GroupDiff. Given a group photo, we can (a) manipulate the existing person, (b) remove a person, and (c) insert a person.
  • Figure 2: Illustration of Common Editing Requests. (a) When we insert a person who only has a half-body picture, we need to adjust the interactions and make the lower part of her body complete. (b) When we are to remove a person from a group photo, we need to change the interactions and inpaint the removed region.
  • Figure 3: Overview of GroupDiff. Starting from a group photo from the dataset, we first use the training data generation pipeline (Sec. \ref{['sec:data']}) to generate paired data. Then the synthesized pair is fed into the Appearance Preservation Diffusion Model (Sec. \ref{['sec:person-aware']}), where inter-person and intra-person guidance are employed to preserve the identities.
  • Figure 4: Coarse Level Training Data Generation for Person Interaction. At the coarse level, we generate masks according to the bounding boxes of persons.
  • Figure 5: Fine Level Training Data Generation for Person Interaction. At the fine level, we generate masks according to the skeleton and augmented skeleton.
  • ...and 8 more figures