Table of Contents
Fetching ...

DSS-GAN: Directional State Space GAN with Mamba backbone for Class-Conditional Image Synthesis

Aleksander Ogonowski, Konrad Klimaszewski, Przemysław Rokita

Abstract

We present DSS-GAN, the first generative adversarial network to employ Mamba as a hierarchical generator backbone for noise-to-image synthesis. The central contribution is Directional Latent Routing (DLR), a novel conditioning mechanism that decomposes the latent vector into direction-specific subvectors, each jointly projected with a class embedding to produce a feature-wise affine modulation of the corresponding Mamba scan. Unlike conventional class conditioning that injects a global signal, DLR couples class identity and latent structure along distinct spatial axes of the feature map, applied consistently across all generative scales. DSS-GAN achieves improved FID, KID, and precision-recall scores compared to StyleGAN2-ADA across multiple tested datasets. Analysis of the latent space reveals that directional subvectors exhibit measurable specialization: perturbations along individual components produce structured, direction-correlated changes in the synthesized image.

DSS-GAN: Directional State Space GAN with Mamba backbone for Class-Conditional Image Synthesis

Abstract

We present DSS-GAN, the first generative adversarial network to employ Mamba as a hierarchical generator backbone for noise-to-image synthesis. The central contribution is Directional Latent Routing (DLR), a novel conditioning mechanism that decomposes the latent vector into direction-specific subvectors, each jointly projected with a class embedding to produce a feature-wise affine modulation of the corresponding Mamba scan. Unlike conventional class conditioning that injects a global signal, DLR couples class identity and latent structure along distinct spatial axes of the feature map, applied consistently across all generative scales. DSS-GAN achieves improved FID, KID, and precision-recall scores compared to StyleGAN2-ADA across multiple tested datasets. Analysis of the latent space reveals that directional subvectors exhibit measurable specialization: perturbations along individual components produce structured, direction-correlated changes in the synthesized image.
Paper Structure (46 sections, 3 equations, 19 figures, 13 tables)

This paper contains 46 sections, 3 equations, 19 figures, 13 tables.

Figures (19)

  • Figure 1: DSS-GAN generator architecture (Only two scan directions shown for clarity). $Z_{base}$ is base global latent vector, $z^{k_K}_{dir}$ are directional latent vectors. For $256 \times 256$ resolution $N$ is $5$ and $C_0$ - feature map channels in DLR blocks is $148$.
  • Figure 2: DLR block. For each scan direction $k$, the directional latent $\mathbf{z}_\text{dir}^k$ and class embedding $\mathbf{e}_k$ are projected to affine parameters $(\boldsymbol{\gamma}_k, \boldsymbol{\beta}_k)$ that modulate the token sequence before the Mamba SSM. Both the class and the directional latent determine the contribution of each scan direction to the generated feature map, with directional weights $w^{\text{dir}}_k = f(y, z^{\text{dir}}_k)$ satisfying $\sum_{k=1}^{K} w_{\text{dir}}^k = 1$. A random $180^\circ$ rotation is applied before the block and inverted after unscan.
  • Figure 3: Per-direction feature maps (mean absolute activation, averaged over 10 channels) at each resolution stage, for a representative sample from the LSUN tower class (from bridge, church, tower classes) in Mamba blocks. Each row corresponds to one scan direction; each column to one resolution stage. At low resolutions directions capture complementary aspects of global structure; at higher resolutions the directional geometry becomes explicit, with row, column, and diagonal activations forming spatially oriented patterns consistent with their respective scan axes.
  • Figure 4: Evolution of per-direction routing weights across resolution stages during training (AFHQ $256{\times}256$, 3-direction model). Each line corresponds to a training snapshot; colour indicates epoch (light = early, dark = late). The dashed red line marks the uniform weight $1/K$. Arrows indicate the direction of change between the first and last snapshot. Weights diverge from uniform and specialise by resolution, with the column scan dominating at $16{\times}16$ and the diagonal scan peaking at $32{\times}32$.
  • Figure 5: The same latent vector $\mathbf{z}$ generated with three different class labels. The spatial layout is preserved across classes, confirming that $\mathbf{z}_\text{base}$ controls composition independently of class identity. Left: AFHQ $256 \times 256$ , right : LSUN $256 \times 256$.
  • ...and 14 more figures