Table of Contents
Fetching ...

Generative Fields: Uncovering Hierarchical Feature Control for StyleGAN via Inverted Receptive Fields

Zhuo He, Paul Henderson, Nicolas Pugeault

TL;DR

This work introduces generative fields to explain how StyleGAN2 synthesizes features across scales through inverted receptive fields, enabling interpretable, hierarchical feature control. It leverages the channel-wise style space $\mathcal{S}$ to design an editing pipeline that disentangles content generation from pose and expression editing at synthesis time, without retraining the generator. The proposed five-network architecture ($G$, $E_{id}$, $E_{attr}$, $M_{ref}$, $E_{lnd}$) with losses $\mathcal{L}_{id}$, $\mathcal{L}_{attr}$, and $\mathcal{L}_{rec}$, augmented by style-space regularization, yields improved identity preservation and pose editing versus prior methods, and reveals a sparse set of style channels that drive edits. The findings highlight a trade-off between generative-field size and editing fidelity and offer a theoretically grounded, efficient approach for fine-grained face editing with potential limitations due to limited 3D supervision.

Abstract

StyleGAN has demonstrated the ability of GANs to synthesize highly-realistic faces of imaginary people from random noise. One limitation of GAN-based image generation is the difficulty of controlling the features of the generated image, due to the strong entanglement of the low-dimensional latent space. Previous work that aimed to control StyleGAN with image or text prompts modulated sampling in W latent space, which is more expressive than Z latent space. However, W space still has restricted expressivity since it does not control the feature synthesis directly; also the feature embedding in W space requires a pre-training process to reconstruct the style signal, limiting its application. This paper introduces the concept of "generative fields" to explain the hierarchical feature synthesis in StyleGAN, inspired by the receptive fields of convolution neural networks (CNNs). Additionally, we propose a new image editing pipeline for StyleGAN using generative field theory and the channel-wise style latent space S, utilizing the intrinsic structural feature of CNNs to achieve disentangled control of feature synthesis at synthesis time.

Generative Fields: Uncovering Hierarchical Feature Control for StyleGAN via Inverted Receptive Fields

TL;DR

This work introduces generative fields to explain how StyleGAN2 synthesizes features across scales through inverted receptive fields, enabling interpretable, hierarchical feature control. It leverages the channel-wise style space to design an editing pipeline that disentangles content generation from pose and expression editing at synthesis time, without retraining the generator. The proposed five-network architecture (, , , , ) with losses , , and , augmented by style-space regularization, yields improved identity preservation and pose editing versus prior methods, and reveals a sparse set of style channels that drive edits. The findings highlight a trade-off between generative-field size and editing fidelity and offer a theoretically grounded, efficient approach for fine-grained face editing with potential limitations due to limited 3D supervision.

Abstract

StyleGAN has demonstrated the ability of GANs to synthesize highly-realistic faces of imaginary people from random noise. One limitation of GAN-based image generation is the difficulty of controlling the features of the generated image, due to the strong entanglement of the low-dimensional latent space. Previous work that aimed to control StyleGAN with image or text prompts modulated sampling in W latent space, which is more expressive than Z latent space. However, W space still has restricted expressivity since it does not control the feature synthesis directly; also the feature embedding in W space requires a pre-training process to reconstruct the style signal, limiting its application. This paper introduces the concept of "generative fields" to explain the hierarchical feature synthesis in StyleGAN, inspired by the receptive fields of convolution neural networks (CNNs). Additionally, we propose a new image editing pipeline for StyleGAN using generative field theory and the channel-wise style latent space S, utilizing the intrinsic structural feature of CNNs to achieve disentangled control of feature synthesis at synthesis time.

Paper Structure

This paper contains 22 sections, 12 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Image generation process in StyleGAN2. Bottom shows the whole generation pipeline including content process and style sampling process; top shows the detailed style modulation process.
  • Figure 2: Generative fields produced by convolution units at different StyleGAN2 generator blocks. The leftmost unit feature map size is $8\times8$ and controls the largest generative field, of size $251\times251$; conversely, the rightmost unit feature map size is $128\times128$ and controls the smallest generative field, of size $11\times11$.
  • Figure 3: Facial landmarks (left) and head pose Euler angles (right).
  • Figure 4: Image editing pipeline for StyleGAN2 using style space $\mathcal{S}$. Identity input including latent vector $Z$ and corresponding generated image $I_{id}$ for the facial generation with identical features. The attribute input is a reference image $I_{attr}$ from which we extract facial features (expression, head pose) for controlling the image generation. All control signals work within each generator block, modulating the style signal samples in layer-wise style space $\mathcal{S}$.
  • Figure 5: Image editing result. Identity images are generated from StyleGAN2 randomly, attribute images are the real image set sampled from FFHQ256 dataset, identity images should capture pose and expression from attribute images.
  • ...and 4 more figures