Table of Contents
Fetching ...

Sequential Disentanglement by Extracting Static Information From A Single Sequence Element

Nimrod Berman, Ilan Naiman, Idan Arbiv, Gal Fadlon, Omri Azencot

TL;DR

The paper tackles unsupervised sequential disentanglement by mitigating information leakage between static and dynamic factors. It introduces a novel posterior that conditions the static factor on a single sequence element and employs a subtraction-driven architectural bias to remove static content from the dynamic path, yielding a simpler, MI-free objective. The approach achieves state-of-the-art or competitive results across video, audio, and time-series benchmarks on both generation and prediction tasks, and it demonstrates reduced information leakage via targeted evaluations. The method is data- and modality-agnostic with a lightweight objective and architecture, offering robust disentanglement and improved downstream performance while avoiding complex MI penalties. This has practical implications for controllable generation, robust representation learning, and cross-domain applications.

Abstract

One of the fundamental representation learning tasks is unsupervised sequential disentanglement, where latent codes of inputs are decomposed to a single static factor and a sequence of dynamic factors. To extract this latent information, existing methods condition the static and dynamic codes on the entire input sequence. Unfortunately, these models often suffer from information leakage, i.e., the dynamic vectors encode both static and dynamic information, or vice versa, leading to a non-disentangled representation. Attempts to alleviate this problem via reducing the dynamic dimension and auxiliary loss terms gain only partial success. Instead, we propose a novel and simple architecture that mitigates information leakage by offering a simple and effective subtraction inductive bias while conditioning on a single sample. Remarkably, the resulting variational framework is simpler in terms of required loss terms, hyperparameters, and data augmentation. We evaluate our method on multiple data-modality benchmarks including general time series, video, and audio, and we show beyond state-of-the-art results on generation and prediction tasks in comparison to several strong baselines.

Sequential Disentanglement by Extracting Static Information From A Single Sequence Element

TL;DR

The paper tackles unsupervised sequential disentanglement by mitigating information leakage between static and dynamic factors. It introduces a novel posterior that conditions the static factor on a single sequence element and employs a subtraction-driven architectural bias to remove static content from the dynamic path, yielding a simpler, MI-free objective. The approach achieves state-of-the-art or competitive results across video, audio, and time-series benchmarks on both generation and prediction tasks, and it demonstrates reduced information leakage via targeted evaluations. The method is data- and modality-agnostic with a lightweight objective and architecture, offering robust disentanglement and improved downstream performance while avoiding complex MI penalties. This has practical implications for controllable generation, robust representation learning, and cross-domain applications.

Abstract

One of the fundamental representation learning tasks is unsupervised sequential disentanglement, where latent codes of inputs are decomposed to a single static factor and a sequence of dynamic factors. To extract this latent information, existing methods condition the static and dynamic codes on the entire input sequence. Unfortunately, these models often suffer from information leakage, i.e., the dynamic vectors encode both static and dynamic information, or vice versa, leading to a non-disentangled representation. Attempts to alleviate this problem via reducing the dynamic dimension and auxiliary loss terms gain only partial success. Instead, we propose a novel and simple architecture that mitigates information leakage by offering a simple and effective subtraction inductive bias while conditioning on a single sample. Remarkably, the resulting variational framework is simpler in terms of required loss terms, hyperparameters, and data augmentation. We evaluate our method on multiple data-modality benchmarks including general time series, video, and audio, and we show beyond state-of-the-art results on generation and prediction tasks in comparison to several strong baselines.
Paper Structure (62 sections, 10 equations, 16 figures, 11 tables)

This paper contains 62 sections, 10 equations, 16 figures, 11 tables.

Figures (16)

  • Figure 1: Our network is composed of an encoder (left), a decoder (right) and two paths in-between for computing the static factor (top) and the dynamic components (bottom). For full architecture details, see App. \ref{['app:architecture']}.
  • Figure 2: t-SNE plots on MUG dataset of the latent static and dynamic factors. Latent static codes, colored by subject identity (left), and latent dynamic codes, colored by dynamic attribute (right).
  • Figure 3: t-SNE visualization of static features from the Air Quality dataset, depicting the model's ability to distinguish days based on precipitation, irrespective of season. Each point represents a day, colored by season and scaled by the amount of rain, illustrating the model's nuanced clustering of dry and wet days within the static seasonal context.
  • Figure 4: Two qualitative examples of swap between source and target sequences. A is the source, B is the target, C is when static is swapped from source to target, and D is when dynamics are swapped. See details in Sec. \ref{['subsec:qualitative_eval']}.
  • Figure 5: Qualitative example of swap between source and target sequences. A is the source, B is the target, C (E) is when static is swapped from source to target, and D (F) is when dynamics are swapped. C and D are the swaps for SPYL method and E and F are swaps of our method. We observe in D that the identity is changed when transferring the dynamics. See more in Sec. \ref{['subsec:qualitative_eval']}, Fig. \ref{['fig:swap_comp_1']} and Fig. \ref{['fig:swap_comp_2']}.
  • ...and 11 more figures