Table of Contents
Fetching ...

CA-Edit: Causality-Aware Condition Adapter for High-Fidelity Local Facial Attribute Editing

Xiaole Xian, Xilin He, Zenghao Niu, Junliang Zhang, Weicheng Xie, Siyang Song, Zitong Yu, Linlin Shen

TL;DR

CA-Edit tackles high-fidelity local facial attribute editing driven by textual descriptions. It introduces LAMask-Caption for local attribute captions, a Causality-Aware Condition Adapter (CA^{2}) to fuse original-skin details with text cues, and Skin Transition Frequency Guidance (STFG) to ensure natural boundary transitions via low-frequency guidance in the diffusion process. The approach achieves superior fidelity and editability on local edits without attribute-specific fine-tuning, validated through quantitative metrics, user studies, and qualitative comparisons, with code available at the project repository. This work advances practical, text-guided facial editing by explicitly modeling local context and boundary coherence within diffusion-based inpainting.

Abstract

For efficient and high-fidelity local facial attribute editing, most existing editing methods either require additional fine-tuning for different editing effects or tend to affect beyond the editing regions. Alternatively, inpainting methods can edit the target image region while preserving external areas. However, current inpainting methods still suffer from the generation misalignment with facial attributes description and the loss of facial skin details. To address these challenges, (i) a novel data utilization strategy is introduced to construct datasets consisting of attribute-text-image triples from a data-driven perspective, (ii) a Causality-Aware Condition Adapter is proposed to enhance the contextual causality modeling of specific details, which encodes the skin details from the original image while preventing conflicts between these cues and textual conditions. In addition, a Skin Transition Frequency Guidance technique is introduced for the local modeling of contextual causality via sampling guidance driven by low-frequency alignment. Extensive quantitative and qualitative experiments demonstrate the effectiveness of our method in boosting both fidelity and editability for localized attribute editing. The code is available at https://github.com/connorxian/CA-Edit.

CA-Edit: Causality-Aware Condition Adapter for High-Fidelity Local Facial Attribute Editing

TL;DR

CA-Edit tackles high-fidelity local facial attribute editing driven by textual descriptions. It introduces LAMask-Caption for local attribute captions, a Causality-Aware Condition Adapter (CA^{2}) to fuse original-skin details with text cues, and Skin Transition Frequency Guidance (STFG) to ensure natural boundary transitions via low-frequency guidance in the diffusion process. The approach achieves superior fidelity and editability on local edits without attribute-specific fine-tuning, validated through quantitative metrics, user studies, and qualitative comparisons, with code available at the project repository. This work advances practical, text-guided facial editing by explicitly modeling local context and boundary coherence within diffusion-based inpainting.

Abstract

For efficient and high-fidelity local facial attribute editing, most existing editing methods either require additional fine-tuning for different editing effects or tend to affect beyond the editing regions. Alternatively, inpainting methods can edit the target image region while preserving external areas. However, current inpainting methods still suffer from the generation misalignment with facial attributes description and the loss of facial skin details. To address these challenges, (i) a novel data utilization strategy is introduced to construct datasets consisting of attribute-text-image triples from a data-driven perspective, (ii) a Causality-Aware Condition Adapter is proposed to enhance the contextual causality modeling of specific details, which encodes the skin details from the original image while preventing conflicts between these cues and textual conditions. In addition, a Skin Transition Frequency Guidance technique is introduced for the local modeling of contextual causality via sampling guidance driven by low-frequency alignment. Extensive quantitative and qualitative experiments demonstrate the effectiveness of our method in boosting both fidelity and editability for localized attribute editing. The code is available at https://github.com/connorxian/CA-Edit.

Paper Structure

This paper contains 25 sections, 17 equations, 16 figures, 3 tables, 2 algorithms.

Figures (16)

  • Figure 1: (Top) The existing text-guided inpainting pipeline for our local attribute editing task. (Bottom) Our method takes account of the causality of the the specific details from the original image, improving the editability and the fidelity.
  • Figure 2: The pipeline of LAMask-Caption construction.
  • Figure 3: The training process of our method. The CA${^2}$ in the Reference Net to inject specific skin details from the original image as image embedding via an additional attention mechanism. Furthermore, the CA${^2}$ employs an adaptive score map that dynamically modulates the intensity of the visual condition, preventing conflict the causality modeling.
  • Figure 4: Qualitative comparison on local facial attributes editing. Compared with zero-shot methods (i.e. SD inpainting wang2022high, InstructPix2Pix brooks2023instructpix2pix, BrushNet ju2024brushnet) and the facial editing methods ( StyleClip patashnik2021styleclip, Diffusionclip kim2022diffusionclip, Asyrp KwonJU23 ), our approach not only aligns the edited parts with the text prompts, but also better preserves the information from the original image.
  • Figure 5: The visualization of the score in CA$^{2}$ during inference. The lighter regions indicate the higher values in the maps. The DDIM scheduler with $t=50$ timesteps is used.
  • ...and 11 more figures