Table of Contents
Fetching ...

StyleAlign: Analysis and Applications of Aligned StyleGAN Models

Zongze Wu, Yotam Nitzan, Eli Shechtman, Dani Lischinski

TL;DR

The paper analyzes aligned StyleGAN2 models obtained by fine-tuning a parent network to a new domain, revealing that latent spaces $\mathcal{W}$ and $\mathcal{S}$ retain rich semantics and that most fine-tuning changes occur in feature convolution layers. It demonstrates robust semantic alignment across related and distant domains, with some directions appearing forgotten but recoverable when retraining toward the parent. Building on these insights, the authors showcase simple yet effective applications: cross-domain image translation, automatic morphing, and zero-shot transfer tasks, often achieving state-of-the-art results with minimal task-specific engineering. The work provides a thorough empirical study of alignment, introduces practical inversion and interpolation techniques, and releases resources to facilitate reproduction and further exploration of aligned generative models.

Abstract

In this paper, we perform an in-depth study of the properties and applications of aligned generative models. We refer to two models as aligned if they share the same architecture, and one of them (the child) is obtained from the other (the parent) via fine-tuning to another domain, a common practice in transfer learning. Several works already utilize some basic properties of aligned StyleGAN models to perform image-to-image translation. Here, we perform the first detailed exploration of model alignment, also focusing on StyleGAN. First, we empirically analyze aligned models and provide answers to important questions regarding their nature. In particular, we find that the child model's latent spaces are semantically aligned with those of the parent, inheriting incredibly rich semantics, even for distant data domains such as human faces and churches. Second, equipped with this better understanding, we leverage aligned models to solve a diverse set of tasks. In addition to image translation, we demonstrate fully automatic cross-domain image morphing. We further show that zero-shot vision tasks may be performed in the child domain, while relying exclusively on supervision in the parent domain. We demonstrate qualitatively and quantitatively that our approach yields state-of-the-art results, while requiring only simple fine-tuning and inversion.

StyleAlign: Analysis and Applications of Aligned StyleGAN Models

TL;DR

The paper analyzes aligned StyleGAN2 models obtained by fine-tuning a parent network to a new domain, revealing that latent spaces and retain rich semantics and that most fine-tuning changes occur in feature convolution layers. It demonstrates robust semantic alignment across related and distant domains, with some directions appearing forgotten but recoverable when retraining toward the parent. Building on these insights, the authors showcase simple yet effective applications: cross-domain image translation, automatic morphing, and zero-shot transfer tasks, often achieving state-of-the-art results with minimal task-specific engineering. The work provides a thorough empirical study of alignment, introduces practical inversion and interpolation techniques, and releases resources to facilitate reproduction and further exploration of aligned generative models.

Abstract

In this paper, we perform an in-depth study of the properties and applications of aligned generative models. We refer to two models as aligned if they share the same architecture, and one of them (the child) is obtained from the other (the parent) via fine-tuning to another domain, a common practice in transfer learning. Several works already utilize some basic properties of aligned StyleGAN models to perform image-to-image translation. Here, we perform the first detailed exploration of model alignment, also focusing on StyleGAN. First, we empirically analyze aligned models and provide answers to important questions regarding their nature. In particular, we find that the child model's latent spaces are semantically aligned with those of the parent, inheriting incredibly rich semantics, even for distant data domains such as human faces and churches. Second, equipped with this better understanding, we leverage aligned models to solve a diverse set of tasks. In addition to image translation, we demonstrate fully automatic cross-domain image morphing. We further show that zero-shot vision tasks may be performed in the child domain, while relying exclusively on supervision in the parent domain. We demonstrate qualitatively and quantitatively that our approach yields state-of-the-art results, while requiring only simple fine-tuning and inversion.

Paper Structure

This paper contains 21 sections, 33 figures, 7 tables.

Figures (33)

  • Figure 1: Comparison of I2I translation (cat2dog and dog2wild in the AFHQ dataset) with two state-of-the-art methods. Our method generates realistic target domain images that capture the pose from the source image. In contrast, both CUT and F-LSeSim fail to generate realistic images since they follow the shape of the source domain image too closely. A quantitative comparison in the table below indicates our method is superior by a wide margin, in both FID and KID.
  • Figure 2: Comparison of reference-based image translation with StarGAN2 and OverLORD. Our method generates realistic target domain images that combine pose and structure from the source image with texture and color from the reference. StarGAN2 follows the source shape too closely, resulting in non-realistic animals (1st example in dog2cat, all examples in wild2dog). OverLORD's results preserve the appearance of the reference well, but sometimes fail to capture the pose and structure (e.g., ear shape) from the source image (2nd and 3rd examples in wild2dog). A quantitative comparison in the table below indicates superior performance of our method in both FID and KID.
  • Figure 3: Zero-shot dog attribute classification using aligned models (FFHQ and AFHQ dogs). In the top row a human "black hair" classifier becomes a "black fur" classifier, a "curly hair" classifier is able to classify "curly fur", and a "long hair" classifier becomes a "down-pointing ears" classifier. The neutral columns correspond to images whose prediction scores are close to the cutoff value.
  • Figure 4: We reset the weights of different components in child models (Mega, dog, church) to their initial values, which come from the parent model (FFHQ). When resetting the weights in feature convolution layers, the output images change more drastically (content, structure), while resetting the weights of other components causes milder effects. This implies feature convolution layers contain most of new learned knowledge.
  • Figure 5: Semantic alignment: semantic controls discovered for the parent model (FFHQ) retain their function in the children models (Mega and Metface). This holds for individual channels in $\mathcal{S}$ (bangs, smile, gaze), where the layer and channel number is indicated under each column. Semantic alignment is also observed for manipulation directions in $\mathcal{W}$ (pose, age, gender).
  • ...and 28 more figures