Table of Contents
Fetching ...

Style Quantization for Data-Efficient GAN Training

Jian Wang, Xin Lan, Jizhe Zhou, Yuxin Tian, Jiancheng Lv

TL;DR

SQ-GAN introduces a style-space quantization framework that converts a sparse input latent space into a compact, discrete proxy $\\mathcal{W}^q$ by partitioning $\\mathcal{W}$ into $s$ sub-vectors and quantizing each with a learnable codebook. By integrating a knowledge-enhanced codebook initialization via optimal transport and CLIP-based semantic alignment, the method embeds external knowledge into the codebook to produce a semantically rich vocabulary for limited-data GAN training. The approach couples adversarial losses with a suite of quantization and CR losses, including a novel uniformity regularization to prevent codebook collapse, and a quantization-based CR to stabilize discriminator evaluations under perturbations. Experimental results on four datasets demonstrate substantial gains in FID, IS, and KID, with ablations confirming the benefits of code dimension, uniformity, and CBI. Overall, SQ-GAN provides a data-efficient pathway to robustly leverage the latent space for high-quality image synthesis under data scarcity.

Abstract

Under limited data setting, GANs often struggle to navigate and effectively exploit the input latent space. Consequently, images generated from adjacent variables in a sparse input latent space may exhibit significant discrepancies in realism, leading to suboptimal consistency regularization (CR) outcomes. To address this, we propose \textit{SQ-GAN}, a novel approach that enhances CR by introducing a style space quantization scheme. This method transforms the sparse, continuous input latent space into a compact, structured discrete proxy space, allowing each element to correspond to a specific real data point, thereby improving CR performance. Instead of direct quantization, we first map the input latent variables into a less entangled ``style'' space and apply quantization using a learnable codebook. This enables each quantized code to control distinct factors of variation. Additionally, we optimize the optimal transport distance to align the codebook codes with features extracted from the training data by a foundation model, embedding external knowledge into the codebook and establishing a semantically rich vocabulary that properly describes the training dataset. Extensive experiments demonstrate significant improvements in both discriminator robustness and generation quality with our method.

Style Quantization for Data-Efficient GAN Training

TL;DR

SQ-GAN introduces a style-space quantization framework that converts a sparse input latent space into a compact, discrete proxy by partitioning into sub-vectors and quantizing each with a learnable codebook. By integrating a knowledge-enhanced codebook initialization via optimal transport and CLIP-based semantic alignment, the method embeds external knowledge into the codebook to produce a semantically rich vocabulary for limited-data GAN training. The approach couples adversarial losses with a suite of quantization and CR losses, including a novel uniformity regularization to prevent codebook collapse, and a quantization-based CR to stabilize discriminator evaluations under perturbations. Experimental results on four datasets demonstrate substantial gains in FID, IS, and KID, with ablations confirming the benefits of code dimension, uniformity, and CBI. Overall, SQ-GAN provides a data-efficient pathway to robustly leverage the latent space for high-quality image synthesis under data scarcity.

Abstract

Under limited data setting, GANs often struggle to navigate and effectively exploit the input latent space. Consequently, images generated from adjacent variables in a sparse input latent space may exhibit significant discrepancies in realism, leading to suboptimal consistency regularization (CR) outcomes. To address this, we propose \textit{SQ-GAN}, a novel approach that enhances CR by introducing a style space quantization scheme. This method transforms the sparse, continuous input latent space into a compact, structured discrete proxy space, allowing each element to correspond to a specific real data point, thereby improving CR performance. Instead of direct quantization, we first map the input latent variables into a less entangled ``style'' space and apply quantization using a learnable codebook. This enables each quantized code to control distinct factors of variation. Additionally, we optimize the optimal transport distance to align the codebook codes with features extracted from the training data by a foundation model, embedding external knowledge into the codebook and establishing a semantically rich vocabulary that properly describes the training dataset. Extensive experiments demonstrate significant improvements in both discriminator robustness and generation quality with our method.

Paper Structure

This paper contains 51 sections, 20 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Illustration of the proposed style quantization GAN (SQ-GAN) framework. (I) The input latent variables $\bm{z}$ are mapped to an intermediate latent space $\mathcal{W}$, which is quantized into a compact and structured proxy space $\mathcal{W}^q$ by a learnable codebook $\mathcal{C}$. The quantized codes $\bm{w}^q\in\mathcal{W}^q$ are then fed into the synthesis network to generate images. (II) The codebook initialization aligns the codebook codes with features extracted from the training data, embedding external knowledge into the codebook.
  • Figure 2: The overall framework of SQ-GAN. (a) Style quantization. For a batch of intermediate variables $\bm{w}$, we apply an intermediate latent space quantization technique, segmenting and quantizing them using a hyperspherical codebook. The concatenated discrete codes are directly fed into the generator for image synthesis. (b) Knowledge-enhanced codebook initialization. We perform an alignment strategy grounded in optimal transport distance to embed semantic knowledge from foundation models into the codebook. The transformation and transmission of latent variables and features occur within the corresponding latent space, as illustrated at the top of the figure. Our framework constructs a vocabulary-rich codebook, which ensures that the entries within the codebook adequately represent a diverse and compact set of image features, suitable for image generation in limited data scenarios.
  • Figure 3: Evolutions of FID ($\Downarrow$) scores during training on the Oxford-Dog and FFHQ-2.5K datasets.
  • Figure 4: Distribution of semantic similarity. We compute the cosine similarity between the features extracted from the image and semantic information from category text using the CLIP model, denoted as $\Phi(\cdot)$.
  • Figure 5: Evolutions of FID ($\Downarrow$) scores during training on the Oxford-Dog and FFHQ-2.5K datasets with different code dimensions.
  • ...and 8 more figures