Table of Contents
Fetching ...

Unsupervised Learning of Disentangled Representations from Video

Remi Denton, Vighnesh Birodkar

TL;DR

The paper tackles unsupervised learning of video representations by disentangling frame content (time-invariant) from pose (time-varying) using an adversarial loss. The DrNET framework employs dual encoders, a decoder, and a discriminator to achieve latent-factor separation, enabling robust long-range frame prediction with a standard LSTM in the latent space. It demonstrates that content features capture semantic information while pose features encode dynamics, with applications to both future-frame generation and classification on diverse datasets (MNIST, NORB, SUNCG, KTH). Despite its simplicity, DrNET achieves competitive or superior results to state-of-the-art baselines on several tasks, and the authors release code to facilitate adoption and further research.

Abstract

We present a new model DrNET that learns disentangled image representations from video. Our approach leverages the temporal coherence of video and a novel adversarial loss to learn a representation that factorizes each frame into a stationary part and a temporally varying component. The disentangled representation can be used for a range of tasks. For example, applying a standard LSTM to the time-vary components enables prediction of future frames. We evaluate our approach on a range of synthetic and real videos, demonstrating the ability to coherently generate hundreds of steps into the future.

Unsupervised Learning of Disentangled Representations from Video

TL;DR

The paper tackles unsupervised learning of video representations by disentangling frame content (time-invariant) from pose (time-varying) using an adversarial loss. The DrNET framework employs dual encoders, a decoder, and a discriminator to achieve latent-factor separation, enabling robust long-range frame prediction with a standard LSTM in the latent space. It demonstrates that content features capture semantic information while pose features encode dynamics, with applications to both future-frame generation and classification on diverse datasets (MNIST, NORB, SUNCG, KTH). Despite its simplicity, DrNET achieves competitive or superior results to state-of-the-art baselines on several tasks, and the authors release code to facilitate adoption and further research.

Abstract

We present a new model DrNET that learns disentangled image representations from video. Our approach leverages the temporal coherence of video and a novel adversarial loss to learn a representation that factorizes each frame into a stationary part and a temporally varying component. The disentangled representation can be used for a range of tasks. For example, applying a standard LSTM to the time-vary components enables prediction of future frames. We evaluate our approach on a range of synthetic and real videos, demonstrating the ability to coherently generate hundreds of steps into the future.

Paper Structure

This paper contains 14 sections, 6 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: Left: The discriminator $C$ is trained with binary cross entropy (BCE) loss to predict if a pair of pose vectors comes from the same (top portion) or different (lower portion) scenes. $x_i$ and $x_j$ denote frames from different sequences $i$ and $j$. The frame offset $k$ is sampled uniformly in the range $[0,K]$. Note that when $C$ is trained, the pose encoder $E_p$ is fixed. Right: The overall model, showing all terms in the loss function. Note that when the pose encoder $E_p$ is updated, the scene discriminator is held fixed.
  • Figure 2: Generating future frames by recurrently predicting $h_p$, the latent pose vector.
  • Figure 3: Left: Demonstration of content/pose factorization on held out MNIST examples. Each image in the grid is generated using the pose and content vectors $h_p$ and $h_c$ taken from the corresponding images in the top row and first column respectively. The model has clearly learned to disentangle content and pose. Right: Each row shows forward modeling up to 500 time steps into the future, given 5 initial frames. For each generation, note that only the pose part of the representation is being predicted from the previous time step (using an LSTM), with the content vector being fixed from the 5th frame. The generations remain crisp despite the long-range nature of the predictions.
  • Figure 4: Left: Factorization examples using our DrNet model on held out NORB images. Each image in the grid is generated using the pose and content vectors $h_p$ and $h_c$ taken from the corresponding images in the top row and first column respectively. Further examples can be found in the suplemental material. Center: Examples where DrNet was trained without the adversarial loss term. Note how content and pose are no longer factorized cleanly: the pose vector now contains content information which ends up dominating the generation. Right: factorization examples from Mathieu et al. mathieu2016.
  • Figure 5: Left: Examples of linear interpolation in pose space between the examples $x_1$ and $x_2$. Right: Factorization examples on held out images from the SUNCG dataset. Each image in the grid is generated using the pose and content vectors $h_p$ and $h_c$ taken from the corresponding images in the top row and first column respectively. Note how, even for complex objects, the model is able to rotate them accurately.
  • ...and 7 more figures