Table of Contents
Fetching ...

Sequential Representation Learning via Static-Dynamic Conditional Disentanglement

Mathieu Cyrille Simon, Pascal Frossard, Christophe De Vleeschouwer

TL;DR

This work tackles unsupervised sequential disentanglement by formalizing video factors into a time-invariant static part $\mathbf{s}$ and time-varying dynamics $\mathbf{d}_{1:T}$, while allowing causal dependencies between them. It introduces a conditional normalizing flow (cNF) architecture that models $p(\mathbf{x}_{1:T}|\mathbf{f})$, with a static code $\mathbf{f}$ and dynamic codes $\boldsymbol{\lambda}_{1:T}=\mathbf{h}^{-1}(\mathbf{x}_{1:T},\mathbf{f})$, and enforces disentanglement via a simple shuffle constraint on $\mathbf{f}_{1:T}$ in the ELBO. The method provides sufficient identifiability conditions (Prop.1 and Prop.2) ensuring that learned codes reparametrize the ground-truth factors, and further demonstrates a provable disentanglement (Prop.3) without extra losses. Empirically, it matches state-of-the-art performance on datasets with independent static/dynamic factors and significantly outperforms baselines on datasets with dependent dynamics, highlighting the method's robustness to complex causal relationships. The approach is modality-free, scalable, and readily extends to other domains beyond video, offering a principled path toward reliable, interpretable sequential representations.

Abstract

This paper explores self-supervised disentangled representation learning within sequential data, focusing on separating time-independent and time-varying factors in videos. We propose a new model that breaks the usual independence assumption between those factors by explicitly accounting for the causal relationship between the static/dynamic variables and that improves the model expressivity through additional Normalizing Flows. A formal definition of the factors is proposed. This formalism leads to the derivation of sufficient conditions for the ground truth factors to be identifiable, and to the introduction of a novel theoretically grounded disentanglement constraint that can be directly and efficiently incorporated into our new framework. The experiments show that the proposed approach outperforms previous complex state-of-the-art techniques in scenarios where the dynamics of a scene are influenced by its content.

Sequential Representation Learning via Static-Dynamic Conditional Disentanglement

TL;DR

This work tackles unsupervised sequential disentanglement by formalizing video factors into a time-invariant static part and time-varying dynamics , while allowing causal dependencies between them. It introduces a conditional normalizing flow (cNF) architecture that models , with a static code and dynamic codes , and enforces disentanglement via a simple shuffle constraint on in the ELBO. The method provides sufficient identifiability conditions (Prop.1 and Prop.2) ensuring that learned codes reparametrize the ground-truth factors, and further demonstrates a provable disentanglement (Prop.3) without extra losses. Empirically, it matches state-of-the-art performance on datasets with independent static/dynamic factors and significantly outperforms baselines on datasets with dependent dynamics, highlighting the method's robustness to complex causal relationships. The approach is modality-free, scalable, and readily extends to other domains beyond video, offering a principled path toward reliable, interpretable sequential representations.

Abstract

This paper explores self-supervised disentangled representation learning within sequential data, focusing on separating time-independent and time-varying factors in videos. We propose a new model that breaks the usual independence assumption between those factors by explicitly accounting for the causal relationship between the static/dynamic variables and that improves the model expressivity through additional Normalizing Flows. A formal definition of the factors is proposed. This formalism leads to the derivation of sufficient conditions for the ground truth factors to be identifiable, and to the introduction of a novel theoretically grounded disentanglement constraint that can be directly and efficiently incorporated into our new framework. The experiments show that the proposed approach outperforms previous complex state-of-the-art techniques in scenarios where the dynamics of a scene are influenced by its content.
Paper Structure (10 sections, 10 equations, 3 figures, 2 tables)

This paper contains 10 sections, 10 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Dynamic generation results for the c-dSprites (A), MPI3D (B) and LPCSprites (C) datasets. The sequences are generated by fixing the static code to the value given by the left image and sampling the dynamic variables from the prior. For each dataset, the first row corresponds to samples from the competing CDSVAE model bai2021contrastively and the second row to samples from our proposed model.
  • Figure 2: Static/dynamic swap results for the MUG dataset. Odd rows are test input sequences while even rows are sequences generated by swapping the static and dynamic codes of the test sequences using our model.
  • Figure 3: Schematic view of the proposed model. A pretrained Convolutional encoder embeds each frame separately into the frame latent space $\mathbf{x}_{1:T}$. These vectors are used as input to the static encoder which estimates the static codes from each individual frame. The estimations are then aggregated giving the static code $\mathbf{f}$. The static vectors serve to condition a Conditional Normalizing Flow that models the likelihood of the frame feature vectors $\mathbf{x}_{1:T}$. The transformed vectors $\lambda_{1:T}$ correspond to the dynamic codes. The model is trained using the loss $\mathcal{L}$ in Eq.\ref{['eq:final']}.