Towards Kinetic Manipulation of the Latent Space

Diego Porres

Towards Kinetic Manipulation of the Latent Space

Diego Porres

TL;DR

This work shows that a simple feature extraction of pre-trained Convolutional Neural Networks from a live RGB camera feed does a very good job at manipulating the latent space with simple changes in the scene, with vast room for improvement.

Abstract

The latent space of many generative models are rich in unexplored valleys and mountains. The majority of tools used for exploring them are so far limited to Graphical User Interfaces (GUIs). While specialized hardware can be used for this task, we show that a simple feature extraction of pre-trained Convolutional Neural Networks (CNNs) from a live RGB camera feed does a very good job at manipulating the latent space with simple changes in the scene, with vast room for improvement. We name this new paradigm Visual-reactive Interpolation, and the full code can be found at https://github.com/PDillis/stylegan3-fun.

Towards Kinetic Manipulation of the Latent Space

TL;DR

Abstract

Paper Structure (12 sections, 2 equations, 4 figures, 1 table)

This paper contains 12 sections, 2 equations, 4 figures, 1 table.

Motivation and Background
Latent Space Interaction
Visual-reactive Interpolation
Test 1: Visual Encoding and Style Mixing
User Feedback
Test 2: Manipulation of Learned Constants
User Feedback
Future Work
Feature extraction
Transformation into $\mathcal{Z}$
Layer selection
Image Synthesis

Figures (4)

Figure 1: Test 1: Visual-reactive live demo using a StyleGAN2 model ($512\times512$ resolution) trained on urban scenes from the A2D2 dataset geyer2020a2d2. We use style mixing with a static latent, and the encoded camera image (resolution $426\times320$) will procure the coarse and middle noise scales. Note the video control above does not work in browsers, but works fine with Adobe Acrobat. Click https://drive.google.com/file/d/1BrC3PulFpdtBdM6p97MaGN43jU4EOBoo/view?usp=sharing for an online version of the video.
Figure 2: Test 1 pipeline. A frame $x$ of a scene is captured with a camera $C$, which is then fed to the feature extractor $F$. We select $F$ to contain $L$ convolutional layers. At layer $l$, the representation of $x$ will be $F^{l}$. We pass this representation through our selected function $g$, turning it into a latent vector $z_{\text{fake}}$, which is then passed to the (frozen) mapping network $f$ of the Generator. To perform style-mixing, we can use a static latent $z_{\text{static}}$, to produce a final disentangled latent vector $w$. It, along the truncation parameter $\psi$, will be used by $G$ to generate the final synthesized image $s$. Note that $\psi$ could also be influenced or controlled via the encoded scene.
Figure 3: Test 2: Live manipulation of the learned constant in StyleGAN2. Note the video control above does not work in browsers, but works fine with Adobe Acrobat. Click https://drive.google.com/file/d/1mKrq7Q0CSoQRqBcONl1DttJwgyKIs-Ls/view?usp=sharing for an online version of the video.
Figure 4: Test 2: Manipulating the learned affine transformation matrix in StyleGAN3. Note the video control above does not work in browsers, but works fine with Adobe Acrobat. Click https://drive.google.com/file/d/1msCAXIIYHU4u_uAjpsJCTzYdcEygPnQ-/view?usp=sharing for an online version of the video.

Towards Kinetic Manipulation of the Latent Space

TL;DR

Abstract

Towards Kinetic Manipulation of the Latent Space

Authors

TL;DR

Abstract

Table of Contents

Figures (4)