Sequential Representation Learning via Static-Dynamic Conditional Disentanglement
Mathieu Cyrille Simon, Pascal Frossard, Christophe De Vleeschouwer
TL;DR
This work tackles unsupervised sequential disentanglement by formalizing video factors into a time-invariant static part $\mathbf{s}$ and time-varying dynamics $\mathbf{d}_{1:T}$, while allowing causal dependencies between them. It introduces a conditional normalizing flow (cNF) architecture that models $p(\mathbf{x}_{1:T}|\mathbf{f})$, with a static code $\mathbf{f}$ and dynamic codes $\boldsymbol{\lambda}_{1:T}=\mathbf{h}^{-1}(\mathbf{x}_{1:T},\mathbf{f})$, and enforces disentanglement via a simple shuffle constraint on $\mathbf{f}_{1:T}$ in the ELBO. The method provides sufficient identifiability conditions (Prop.1 and Prop.2) ensuring that learned codes reparametrize the ground-truth factors, and further demonstrates a provable disentanglement (Prop.3) without extra losses. Empirically, it matches state-of-the-art performance on datasets with independent static/dynamic factors and significantly outperforms baselines on datasets with dependent dynamics, highlighting the method's robustness to complex causal relationships. The approach is modality-free, scalable, and readily extends to other domains beyond video, offering a principled path toward reliable, interpretable sequential representations.
Abstract
This paper explores self-supervised disentangled representation learning within sequential data, focusing on separating time-independent and time-varying factors in videos. We propose a new model that breaks the usual independence assumption between those factors by explicitly accounting for the causal relationship between the static/dynamic variables and that improves the model expressivity through additional Normalizing Flows. A formal definition of the factors is proposed. This formalism leads to the derivation of sufficient conditions for the ground truth factors to be identifiable, and to the introduction of a novel theoretically grounded disentanglement constraint that can be directly and efficiently incorporated into our new framework. The experiments show that the proposed approach outperforms previous complex state-of-the-art techniques in scenarios where the dynamics of a scene are influenced by its content.
