DiffSDA: Unsupervised Diffusion Sequential Disentanglement Across Modalities
Hedi Zisling, Ilan Naiman, Nimrod Berman, Supasorn Suwajanakorn, Omri Azencot
TL;DR
DiffSDA tackles unsupervised sequential disentanglement across video, audio, and time-series by introducing a modal-agnostic diffusion framework that models interdependent static $s_0$ and dynamic $d_0^{1:V}$ latent factors. It advances a probabilistic formulation with latent diffusion and a single score-matching loss, integrated with a VQ-VAE backbone for high-resolution data. Empirical results across multiple modalities show state-of-the-art disentanglement performance, including a novel video-disentanglement evaluation protocol and zero-shot transfer capabilities. The work enables robust cross-modal representation learning with practical impact on downstream tasks and reinforces diffusion models as a unified tool for complex, real-world sequential data.
Abstract
Unsupervised representation learning, particularly sequential disentanglement, aims to separate static and dynamic factors of variation in data without relying on labels. This remains a challenging problem, as existing approaches based on variational autoencoders and generative adversarial networks often rely on multiple loss terms, complicating the optimization process. Furthermore, sequential disentanglement methods face challenges when applied to real-world data, and there is currently no established evaluation protocol for assessing their performance in such settings. Recently, diffusion models have emerged as state-of-the-art generative models, but no theoretical formalization exists for their application to sequential disentanglement. In this work, we introduce the Diffusion Sequential Disentanglement Autoencoder (DiffSDA), a novel, modal-agnostic framework effective across diverse real-world data modalities, including time series, video, and audio. DiffSDA leverages a new probabilistic modeling, latent diffusion, and efficient samplers, while incorporating a challenging evaluation protocol for rigorous testing. Our experiments on diverse real-world benchmarks demonstrate that DiffSDA outperforms recent state-of-the-art methods in sequential disentanglement.
