Table of Contents
Fetching ...

Learning Feature-Preserving Portrait Editing from Generated Pairs

Bowei Chen, Tiancheng Zhi, Peihao Zhu, Shen Sang, Jing Liu, Linjie Luo

TL;DR

The paper tackles portrait editing with the challenge of preserving user identity while applying edits. It proposes a low-cost data-generation pipeline that produces aligned input-target pairs and trains a Multi-Conditioned Diffusion Model that fuses multiple conditioning signals to learn editing directions and guard against unwanted feature changes; a mask-guided inference step further protects subject details. Key contributions include the conditional data generation strategy, the MCDM architecture with spatial, text, and image conditioning, and the automatic editing mask that guides inference. Experiments on costume and cartoon-expression editing show quantitative and user-study evidence of state-of-the-art quality and feature preservation, with ablations highlighting the importance of each component. The approach offers a practical, scalable solution for high-quality, feature-preserving portrait edits with potential applications in real-world editing pipelines.

Abstract

Portrait editing is challenging for existing techniques due to difficulties in preserving subject features like identity. In this paper, we propose a training-based method leveraging auto-generated paired data to learn desired editing while ensuring the preservation of unchanged subject features. Specifically, we design a data generation process to create reasonably good training pairs for desired editing at low cost. Based on these pairs, we introduce a Multi-Conditioned Diffusion Model to effectively learn the editing direction and preserve subject features. During inference, our model produces accurate editing mask that can guide the inference process to further preserve detailed subject features. Experiments on costume editing and cartoon expression editing show that our method achieves state-of-the-art quality, quantitatively and qualitatively.

Learning Feature-Preserving Portrait Editing from Generated Pairs

TL;DR

The paper tackles portrait editing with the challenge of preserving user identity while applying edits. It proposes a low-cost data-generation pipeline that produces aligned input-target pairs and trains a Multi-Conditioned Diffusion Model that fuses multiple conditioning signals to learn editing directions and guard against unwanted feature changes; a mask-guided inference step further protects subject details. Key contributions include the conditional data generation strategy, the MCDM architecture with spatial, text, and image conditioning, and the automatic editing mask that guides inference. Experiments on costume and cartoon-expression editing show quantitative and user-study evidence of state-of-the-art quality and feature preservation, with ablations highlighting the importance of each component. The approach offers a practical, scalable solution for high-quality, feature-preserving portrait edits with potential applications in real-world editing pipelines.

Abstract

Portrait editing is challenging for existing techniques due to difficulties in preserving subject features like identity. In this paper, we propose a training-based method leveraging auto-generated paired data to learn desired editing while ensuring the preservation of unchanged subject features. Specifically, we design a data generation process to create reasonably good training pairs for desired editing at low cost. Based on these pairs, we introduce a Multi-Conditioned Diffusion Model to effectively learn the editing direction and preserve subject features. During inference, our model produces accurate editing mask that can guide the inference process to further preserve detailed subject features. Experiments on costume editing and cartoon expression editing show that our method achieves state-of-the-art quality, quantitatively and qualitatively.
Paper Structure (9 sections, 3 equations, 9 figures, 1 table)

This paper contains 9 sections, 3 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Our method takes a portrait image as input, and applies advanced editing effects with our proposed framework. We can handle both real human portraits (1st row) as well as cartoon characters (2nd row). Our approach obtains superior aesthetic quality while at the same time preserving key features from the input subject. Compared with baseline approaches (left), we achieve better subject feature preservation (e.g., identity), structural alignment, and fewer artifacts.
  • Figure 2: Overview of our pipeline. Paired Data Generation (blue dashed box) first constructs training pairs using Composable Diffusion liu2022compositional conditioning on pose and identity information. Multi-Conditioned Diffusion Model (green dashed box) encodes multiple condition signals to learn the editing direction and preserve subject features based on the generated pairs. The multi-condition design enhances the robustness in handling imperfections within training pairs.
  • Figure 3: Examples of pairs generated by different strategies. Prompt-to-Prompt (a) fails to produce pairs with consistent identity. Without pose condition, (b) produces pairs with significant spatial misalignment. Without identity conditions, (c) results in pairs with obvious face shapes difference. Our strategy (d) significantly improves these issues.
  • Figure 4: Training on a dataset with less diverse identities (b) results in inconsistent identity with the input (a). Conversely, training on a dataset with diverse identities yields the desired editing outcome (c), demonstrating its better generalization ability.
  • Figure 5: Illustration of Multi-Conditioned Diffusion Model, where both image and text embeddings are injected into the model through different ways to effectively learn the editing direction and preserve subject features.
  • ...and 4 more figures