Table of Contents
Fetching ...

Gradual Residuals Alignment: A Dual-Stream Framework for GAN Inversion and Image Attribute Editing

Hao Li, Mengqi Huang, Lei Zhang, Bo Hu, Yi Liu, Zhendong Mao

TL;DR

This work tackles the fundamental challenge of reconciling high-fidelity reconstruction with robust image attribute editing in GAN inversion. It introduces GradStyle, a dual-stream framework with a Reconstruction Stream that uses Gradual Residual Modules and a Gate&Fusion mechanism, and an Editing Stream that employs a Global Alignment Module with Selective Attention to align and inject details progressively across multiple generation stages. A self-supervised training strategy enables joint optimization without edited labels, aligning residuals with edited layouts via misalignment augmentations and an aligner loss. Across faces and other domains, GradStyle demonstrates superior reconstruction fidelity and editing quality, with strong generalization and reduced artifacts compared to existing methods. The approach offers a practical pathway to faithful real-image inversion and controllable attribute editing in StyleGAN-based pipelines.

Abstract

GAN-based image attribute editing firstly leverages GAN Inversion to project real images into the latent space of GAN and then manipulates corresponding latent codes. Recent inversion methods mainly utilize additional high-bit features to improve image details preservation, as low-bit codes cannot faithfully reconstruct source images, leading to the loss of details. However, during editing, existing works fail to accurately complement the lost details and suffer from poor editability. The main reason is they inject all the lost details indiscriminately at one time, which inherently induces the position and quantity of details to overfit source images, resulting in inconsistent content and artifacts in edited images. This work argues that details should be gradually injected into both the reconstruction and editing process in a multi-stage coarse-to-fine manner for better detail preservation and high editability. Therefore, a novel dual-stream framework is proposed to accurately complement details at each stage. The Reconstruction Stream is employed to embed coarse-to-fine lost details into residual features and then adaptively add them to the GAN generator. In the Editing Stream, residual features are accurately aligned by our Selective Attention mechanism and then injected into the editing process in a multi-stage manner. Extensive experiments have shown the superiority of our framework in both reconstruction accuracy and editing quality compared with existing methods.

Gradual Residuals Alignment: A Dual-Stream Framework for GAN Inversion and Image Attribute Editing

TL;DR

This work tackles the fundamental challenge of reconciling high-fidelity reconstruction with robust image attribute editing in GAN inversion. It introduces GradStyle, a dual-stream framework with a Reconstruction Stream that uses Gradual Residual Modules and a Gate&Fusion mechanism, and an Editing Stream that employs a Global Alignment Module with Selective Attention to align and inject details progressively across multiple generation stages. A self-supervised training strategy enables joint optimization without edited labels, aligning residuals with edited layouts via misalignment augmentations and an aligner loss. Across faces and other domains, GradStyle demonstrates superior reconstruction fidelity and editing quality, with strong generalization and reduced artifacts compared to existing methods. The approach offers a practical pathway to faithful real-image inversion and controllable attribute editing in StyleGAN-based pipelines.

Abstract

GAN-based image attribute editing firstly leverages GAN Inversion to project real images into the latent space of GAN and then manipulates corresponding latent codes. Recent inversion methods mainly utilize additional high-bit features to improve image details preservation, as low-bit codes cannot faithfully reconstruct source images, leading to the loss of details. However, during editing, existing works fail to accurately complement the lost details and suffer from poor editability. The main reason is they inject all the lost details indiscriminately at one time, which inherently induces the position and quantity of details to overfit source images, resulting in inconsistent content and artifacts in edited images. This work argues that details should be gradually injected into both the reconstruction and editing process in a multi-stage coarse-to-fine manner for better detail preservation and high editability. Therefore, a novel dual-stream framework is proposed to accurately complement details at each stage. The Reconstruction Stream is employed to embed coarse-to-fine lost details into residual features and then adaptively add them to the GAN generator. In the Editing Stream, residual features are accurately aligned by our Selective Attention mechanism and then injected into the editing process in a multi-stage manner. Extensive experiments have shown the superiority of our framework in both reconstruction accuracy and editing quality compared with existing methods.
Paper Structure (19 sections, 15 equations, 18 figures, 4 tables)

This paper contains 19 sections, 15 equations, 18 figures, 4 tables.

Figures (18)

  • Figure 1: Illustration of our motivation. Giving a source image and then editing (e.g., pose) it. (a) Existing high-bit inversion injects lost details of reconstruction into the edited images as much as possible at one time, which leads to inconsistent content and artifacts. (b) Our method gradually aligns and complements lost details at different stages in editing, which achieves a unity of both high-quality details preservation and high editability with the artifacts mitigated.
  • Figure 2: An overview of our dual-stream framework GradStyle. It consists of three parts, an Encoding Phase for embedding images, a Reconstruction Stream for faithful reconstruction and residual features calculation, and an Editing Stream for edited image generation by gradually aligning and adding details information. The proposed Gradual Residual Module and Global Alignment Module are also illustrated, and details of Aligner are especially shown in Fig.\ref{['fig3']}.
  • Figure 3: Detailed structure of Aligner block and an image-level visualization example for Selective Attention (we actually utilize it in the feature level). In the 1st row of the example, a coarsely edited image stands for the block feature $f_m^{edi}$ (query), and the unaligned residual feature $F_m$ (key and value) is on its right. In the 2nd row, for a region of query, its attention map indicates that there are many irrelevant regions, Selective Attention will suppress irrelevant regions and enhance relevant regions. The last row shows that a region of $F_m^{'}$ is combined by similar regions of unaligned $F_m$.
  • Figure 4: Qualitative results of reconstruction and editing. The left shows reconstructed results from several recent methods and our method, and the right shows edited results based on InterfaceGAN shen2020interfacegan. The source is in the middle.
  • Figure 5: Generalizability of our self-supervised training strategy to deal with various attributes.
  • ...and 13 more figures