Table of Contents
Fetching ...

MambaStyle: Efficient StyleGAN Inversion for Real Image Editing with State-Space Models

Jhon Lopez, Carlos Hinojosa, Henry Arguello, Bernard Ghanem

TL;DR

MambaStyle tackles the challenge of inverting real images into StyleGAN latent spaces with a method that balances reconstruction fidelity, editability, and computational efficiency. It introduces Vision State-Space Models (VSSMs) into a single-stage encoder to produce latent codes in $\mathcal{W}^{+}$ and spatial features, augmented by a Fuser that injects editing directions into feature maps for precise, localized edits. The architecture combines a Multi-scale Mamba-based Encoder with a Fuser and uses StyleGAN2 for synthesis, trained with a composite loss that enforces high-fidelity reconstruction and structured, editable transformations via $\mathcal{L}_{\text{rec}}, \mathcal{L}_{\text{perc}}, \mathcal{L}_{\text{id}}, \mathcal{L}_{\text{struct}}, \mathcal{L}_{\text{e}}$. Empirical results on CelebA-HQ and Stanford Cars show MambaStyle achieves superior inversion quality and editing performance while significantly reducing model complexity and inference time, enabling real-time applications. Overall, the work provides a scalable, efficient pathway for high-quality real-image editing with StyleGAN by leveraging VSSMs and targeted feature-level fusion.

Abstract

The task of inverting real images into StyleGAN's latent space to manipulate their attributes has been extensively studied. However, existing GAN inversion methods struggle to balance high reconstruction quality, effective editability, and computational efficiency. In this paper, we introduce MambaStyle, an efficient single-stage encoder-based approach for GAN inversion and editing that leverages vision state-space models (VSSMs) to address these challenges. Specifically, our approach integrates VSSMs within the proposed architecture, enabling high-quality image inversion and flexible editing with significantly fewer parameters and reduced computational complexity compared to state-of-the-art methods. Extensive experiments show that MambaStyle achieves a superior balance among inversion accuracy, editing quality, and computational efficiency. Notably, our method achieves superior inversion and editing results with reduced model complexity and faster inference, making it suitable for real-time applications.

MambaStyle: Efficient StyleGAN Inversion for Real Image Editing with State-Space Models

TL;DR

MambaStyle tackles the challenge of inverting real images into StyleGAN latent spaces with a method that balances reconstruction fidelity, editability, and computational efficiency. It introduces Vision State-Space Models (VSSMs) into a single-stage encoder to produce latent codes in and spatial features, augmented by a Fuser that injects editing directions into feature maps for precise, localized edits. The architecture combines a Multi-scale Mamba-based Encoder with a Fuser and uses StyleGAN2 for synthesis, trained with a composite loss that enforces high-fidelity reconstruction and structured, editable transformations via . Empirical results on CelebA-HQ and Stanford Cars show MambaStyle achieves superior inversion quality and editing performance while significantly reducing model complexity and inference time, enabling real-time applications. Overall, the work provides a scalable, efficient pathway for high-quality real-image editing with StyleGAN by leveraging VSSMs and targeted feature-level fusion.

Abstract

The task of inverting real images into StyleGAN's latent space to manipulate their attributes has been extensively studied. However, existing GAN inversion methods struggle to balance high reconstruction quality, effective editability, and computational efficiency. In this paper, we introduce MambaStyle, an efficient single-stage encoder-based approach for GAN inversion and editing that leverages vision state-space models (VSSMs) to address these challenges. Specifically, our approach integrates VSSMs within the proposed architecture, enabling high-quality image inversion and flexible editing with significantly fewer parameters and reduced computational complexity compared to state-of-the-art methods. Extensive experiments show that MambaStyle achieves a superior balance among inversion accuracy, editing quality, and computational efficiency. Notably, our method achieves superior inversion and editing results with reduced model complexity and faster inference, making it suitable for real-time applications.

Paper Structure

This paper contains 11 sections, 8 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Our method encodes real images into the StyleGAN latent space, applies edits and synthesizes the edited images. We compare our MambaStyle with prior methods in the right plot in terms of MS-SSIM$\downarrow$ for inversion quality and FID$\downarrow$ for editing capability. Larger markers indicate a higher model parameter count.
  • Figure 2: During training, we leverage the pretrained StyleGAN2 Generator and Mapping network to generate an image $X$ and its edited version $X_e$ from a random noise vector $z$ and an editing direction $d$. Then, we learn the latent vectors $\hat{w} \in \mathcal{W}^{+}$ and features $\hat{F}_k$$\in \mathcal{F}_{k}$ using our proposed MambaStyle architecture. This architecture comprises (i) a Multi-scale Mamba-based Encoder that encodes the input image $X$ and produces $w'$ and features $\hat{H}_3$; and (ii) a Fuser module, which integrates the editing direction $d$ with features $\hat{H}_3$ to generate $\hat{F}_k$. Using $\hat{w}$ and $\hat{F}_k$, the pretrained StyleGAN2 model synthesizes an image with $G(\hat{F}_k, \hat{w})$. Our proposed framework is flexible, allowing both image inversion and editing conditioned on the $d$ direction. Specifically, setting $d = \mathbf{0}$ enables image inversion, reconstructing an approximation of the original image $\hat{X}$; when $d \neq \mathbf{0}$, our framework performs image editing, generating the edited image $\hat{X}_e$ based on the specified direction $d$. During inference, we follow the same pipeline, with the encoder taking the real image $X$ from the target dataset.
  • Figure 3: (Right) Vision State-Space Module (VSSM), and (Left) the 2D Selective Scan Submodule (SS2D).
  • Figure 4: Visual comparison of our proposed method with prior encoder-based approaches in the face domain, based on inversion reconstruction and face editability. Row 1: inversion; Row 2: increasing age; Row 3: Afro hairstyle; Row 4: glasses addition.
  • Figure 5: Visual comparison of our proposed method with previous encoder-based approaches in the car domain. Row 1: Inversion; Row 2: Color change; Row 3: grass addition.
  • ...and 2 more figures