Table of Contents
Fetching ...

Auto DragGAN: Editing the Generative Image Manifold in an Autoregressive Manner

Pengxiang Cai, Zhiwei Liu, Guibo Zhu, Yunfang Niu, Jinqiao Wang

TL;DR

Auto DragGAN tackles the challenge of pixel-level fine-grained image editing with fast inference by learning latent-code motion in the StyleGAN latent space. It introduces a Latent Regularizer and a Latent Predictor based on a transformer encoder-decoder that autoregressively predicts latent trajectories $w_0 \rightarrow w_n$. The two-stage training keeps latent motion within the natural $W^{+}$ distribution while enabling pixel-precise edits with high speed, achieving results competitive with or better than DragGAN. This approach enables interactive, high-fidelity editing of GAN-generated images with substantially reduced inference time.

Abstract

Pixel-level fine-grained image editing remains an open challenge. Previous works fail to achieve an ideal trade-off between control granularity and inference speed. They either fail to achieve pixel-level fine-grained control, or their inference speed requires optimization. To address this, this paper for the first time employs a regression-based network to learn the variation patterns of StyleGAN latent codes during the image dragging process. This method enables pixel-level precision in dragging editing with little time cost. Users can specify handle points and their corresponding target points on any GAN-generated images, and our method will move each handle point to its corresponding target point. Through experimental analysis, we discover that a short movement distance from handle points to target points yields a high-fidelity edited image, as the model only needs to predict the movement of a small portion of pixels. To achieve this, we decompose the entire movement process into multiple sub-processes. Specifically, we develop a transformer encoder-decoder based network named 'Latent Predictor' to predict the latent code motion trajectories from handle points to target points in an autoregressive manner. Moreover, to enhance the prediction stability, we introduce a component named 'Latent Regularizer', aimed at constraining the latent code motion within the distribution of natural images. Extensive experiments demonstrate that our method achieves state-of-the-art (SOTA) inference speed and image editing performance at the pixel-level granularity.

Auto DragGAN: Editing the Generative Image Manifold in an Autoregressive Manner

TL;DR

Auto DragGAN tackles the challenge of pixel-level fine-grained image editing with fast inference by learning latent-code motion in the StyleGAN latent space. It introduces a Latent Regularizer and a Latent Predictor based on a transformer encoder-decoder that autoregressively predicts latent trajectories . The two-stage training keeps latent motion within the natural distribution while enabling pixel-precise edits with high speed, achieving results competitive with or better than DragGAN. This approach enables interactive, high-fidelity editing of GAN-generated images with substantially reduced inference time.

Abstract

Pixel-level fine-grained image editing remains an open challenge. Previous works fail to achieve an ideal trade-off between control granularity and inference speed. They either fail to achieve pixel-level fine-grained control, or their inference speed requires optimization. To address this, this paper for the first time employs a regression-based network to learn the variation patterns of StyleGAN latent codes during the image dragging process. This method enables pixel-level precision in dragging editing with little time cost. Users can specify handle points and their corresponding target points on any GAN-generated images, and our method will move each handle point to its corresponding target point. Through experimental analysis, we discover that a short movement distance from handle points to target points yields a high-fidelity edited image, as the model only needs to predict the movement of a small portion of pixels. To achieve this, we decompose the entire movement process into multiple sub-processes. Specifically, we develop a transformer encoder-decoder based network named 'Latent Predictor' to predict the latent code motion trajectories from handle points to target points in an autoregressive manner. Moreover, to enhance the prediction stability, we introduce a component named 'Latent Regularizer', aimed at constraining the latent code motion within the distribution of natural images. Extensive experiments demonstrate that our method achieves state-of-the-art (SOTA) inference speed and image editing performance at the pixel-level granularity.
Paper Structure (16 sections, 15 equations, 10 figures, 3 tables)

This paper contains 16 sections, 15 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: The comparison between UserControllableLT endo2022user, DragGAN pan2023drag, FreeDrag ling2023freedrag and our proposed Auto DragGAN in terms of key performance indicators. Inference time (seconds) $\downarrow$ and image fidelity (FID) $\downarrow$ were both tested in the face landmark manipulation experiment under the settings described in \ref{['subsec:face landmark manipulation']}, based on the 'one point' setting.
  • Figure 2: Users are able to specify handle points (marked as red) and target points (marked as blue) on any GAN-generated images, and our method will precisely move the handle points to reach their corresponding target points, thereby achieving the desired drag effect on the image. We compare DragGAN pan2023drag and DragDiffusion shi2023dragdiffusion with our proposed Auto DragGAN, which demonstrates superior drag performance.
  • Figure 3: The overview of our proposed Auto DragGAN. (a) corresponds to the first stage of training, namely the pre-training of the Latent Regularizer. (b) represents the second stage of training, which is the joint training of the Latent Predictor and the Latent Regularizer.
  • Figure 4: The outlier latent codes. The shortest motion path in the $\mathcal{W^{+}}$ space between the latent code $w_0$ and its edited result $w_n$ is depicted as the blue dashed line in the figure, while the green dashed line represents the motion trajectory learned by our model. $w_n^{'}$ and $w_n^{"}$ are the outlier latent codes, predicted by the model without the use of the Latent Regularizer.
  • Figure 5: Reconstruction of the outlier latent codes. For each set of images, the first, second, and third columns correspond to the initial random sampled latent code $w$, the outlier latent code $w'$, and the reconstructed $\hat{w}$, respectively.
  • ...and 5 more figures