Self-Supervised Disentanglement by Leveraging Structure in Data Augmentations
Cian Eastwood, Julius von Kügelgen, Linus Ericsson, Diane Bouchacourt, Pascal Vincent, Bernhard Schölkopf, Mark Ibrahim
TL;DR
This work tackles the challenge of learning representations with unknown downstream tasks by disentangling style attributes from content through structured data augmentations. It introduces an SSL framework with $M{+}1$ embedding spaces, where $\mathcal{Z}_0$ captures content invariant to all augmentations and each $\mathcal{Z}_m$ captures a specific style, trained via a loss that combines alignment and entropy terms across spaces. A causal latent-variable analysis proves identifiability of content and individual style latents under structured augmentation, and experiments on synthetic data and ImageNet demonstrate reliable content/style separation and downstream gains when more style is retained. The results highlight the projector's role and point to potential for more universal representations by intentionally preserving rather than discarding style information.
Abstract
Self-supervised representation learning often uses data augmentations to induce some invariance to "style" attributes of the data. However, with downstream tasks generally unknown at training time, it is difficult to deduce a priori which attributes of the data are indeed "style" and can be safely discarded. To deal with this, current approaches try to retain some style information by tuning the degree of invariance to some particular task, such as ImageNet object classification. However, prior work has shown that such task-specific tuning can lead to significant performance degradation on other tasks that rely on the discarded style. To address this, we introduce a more principled approach that seeks to disentangle style features rather than discard them. The key idea is to add multiple style embedding spaces where: (i) each is invariant to all-but-one augmentation; and (ii) joint entropy is maximized. We formalize our structured data-augmentation procedure from a causal latent-variable-model perspective, and prove identifiability of both content and individual style variables. We empirically demonstrate the benefits of our approach on both synthetic and real-world data.
