Table of Contents
Fetching ...

Towards Kinetic Manipulation of the Latent Space

Diego Porres

TL;DR

This work shows that a simple feature extraction of pre-trained Convolutional Neural Networks from a live RGB camera feed does a very good job at manipulating the latent space with simple changes in the scene, with vast room for improvement.

Abstract

The latent space of many generative models are rich in unexplored valleys and mountains. The majority of tools used for exploring them are so far limited to Graphical User Interfaces (GUIs). While specialized hardware can be used for this task, we show that a simple feature extraction of pre-trained Convolutional Neural Networks (CNNs) from a live RGB camera feed does a very good job at manipulating the latent space with simple changes in the scene, with vast room for improvement. We name this new paradigm Visual-reactive Interpolation, and the full code can be found at https://github.com/PDillis/stylegan3-fun.

Towards Kinetic Manipulation of the Latent Space

TL;DR

This work shows that a simple feature extraction of pre-trained Convolutional Neural Networks from a live RGB camera feed does a very good job at manipulating the latent space with simple changes in the scene, with vast room for improvement.

Abstract

The latent space of many generative models are rich in unexplored valleys and mountains. The majority of tools used for exploring them are so far limited to Graphical User Interfaces (GUIs). While specialized hardware can be used for this task, we show that a simple feature extraction of pre-trained Convolutional Neural Networks (CNNs) from a live RGB camera feed does a very good job at manipulating the latent space with simple changes in the scene, with vast room for improvement. We name this new paradigm Visual-reactive Interpolation, and the full code can be found at https://github.com/PDillis/stylegan3-fun.
Paper Structure (12 sections, 2 equations, 4 figures, 1 table)

This paper contains 12 sections, 2 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Test 1: Visual-reactive live demo using a StyleGAN2 model ($512\times512$ resolution) trained on urban scenes from the A2D2 dataset geyer2020a2d2. We use style mixing with a static latent, and the encoded camera image (resolution $426\times320$) will procure the coarse and middle noise scales. Note the video control above does not work in browsers, but works fine with Adobe Acrobat. Click https://drive.google.com/file/d/1BrC3PulFpdtBdM6p97MaGN43jU4EOBoo/view?usp=sharing for an online version of the video.
  • Figure 2: Test 1 pipeline. A frame $x$ of a scene is captured with a camera $C$, which is then fed to the feature extractor $F$. We select $F$ to contain $L$ convolutional layers. At layer $l$, the representation of $x$ will be $F^{l}$. We pass this representation through our selected function $g$, turning it into a latent vector $z_{\text{fake}}$, which is then passed to the (frozen) mapping network $f$ of the Generator. To perform style-mixing, we can use a static latent $z_{\text{static}}$, to produce a final disentangled latent vector $w$. It, along the truncation parameter $\psi$, will be used by $G$ to generate the final synthesized image $s$. Note that $\psi$ could also be influenced or controlled via the encoded scene.
  • Figure 3: Test 2: Live manipulation of the learned constant in StyleGAN2. Note the video control above does not work in browsers, but works fine with Adobe Acrobat. Click https://drive.google.com/file/d/1mKrq7Q0CSoQRqBcONl1DttJwgyKIs-Ls/view?usp=sharing for an online version of the video.
  • Figure 4: Test 2: Manipulating the learned affine transformation matrix in StyleGAN3. Note the video control above does not work in browsers, but works fine with Adobe Acrobat. Click https://drive.google.com/file/d/1msCAXIIYHU4u_uAjpsJCTzYdcEygPnQ-/view?usp=sharing for an online version of the video.