Table of Contents
Fetching ...

Self-Supervised Disentanglement by Leveraging Structure in Data Augmentations

Cian Eastwood, Julius von Kügelgen, Linus Ericsson, Diane Bouchacourt, Pascal Vincent, Bernhard Schölkopf, Mark Ibrahim

TL;DR

This work tackles the challenge of learning representations with unknown downstream tasks by disentangling style attributes from content through structured data augmentations. It introduces an SSL framework with $M{+}1$ embedding spaces, where $\mathcal{Z}_0$ captures content invariant to all augmentations and each $\mathcal{Z}_m$ captures a specific style, trained via a loss that combines alignment and entropy terms across spaces. A causal latent-variable analysis proves identifiability of content and individual style latents under structured augmentation, and experiments on synthetic data and ImageNet demonstrate reliable content/style separation and downstream gains when more style is retained. The results highlight the projector's role and point to potential for more universal representations by intentionally preserving rather than discarding style information.

Abstract

Self-supervised representation learning often uses data augmentations to induce some invariance to "style" attributes of the data. However, with downstream tasks generally unknown at training time, it is difficult to deduce a priori which attributes of the data are indeed "style" and can be safely discarded. To deal with this, current approaches try to retain some style information by tuning the degree of invariance to some particular task, such as ImageNet object classification. However, prior work has shown that such task-specific tuning can lead to significant performance degradation on other tasks that rely on the discarded style. To address this, we introduce a more principled approach that seeks to disentangle style features rather than discard them. The key idea is to add multiple style embedding spaces where: (i) each is invariant to all-but-one augmentation; and (ii) joint entropy is maximized. We formalize our structured data-augmentation procedure from a causal latent-variable-model perspective, and prove identifiability of both content and individual style variables. We empirically demonstrate the benefits of our approach on both synthetic and real-world data.

Self-Supervised Disentanglement by Leveraging Structure in Data Augmentations

TL;DR

This work tackles the challenge of learning representations with unknown downstream tasks by disentangling style attributes from content through structured data augmentations. It introduces an SSL framework with embedding spaces, where captures content invariant to all augmentations and each captures a specific style, trained via a loss that combines alignment and entropy terms across spaces. A causal latent-variable analysis proves identifiability of content and individual style latents under structured augmentation, and experiments on synthetic data and ImageNet demonstrate reliable content/style separation and downstream gains when more style is retained. The results highlight the projector's role and point to potential for more universal representations by intentionally preserving rather than discarding style information.

Abstract

Self-supervised representation learning often uses data augmentations to induce some invariance to "style" attributes of the data. However, with downstream tasks generally unknown at training time, it is difficult to deduce a priori which attributes of the data are indeed "style" and can be safely discarded. To deal with this, current approaches try to retain some style information by tuning the degree of invariance to some particular task, such as ImageNet object classification. However, prior work has shown that such task-specific tuning can lead to significant performance degradation on other tasks that rely on the discarded style. To address this, we introduce a more principled approach that seeks to disentangle style features rather than discard them. The key idea is to add multiple style embedding spaces where: (i) each is invariant to all-but-one augmentation; and (ii) joint entropy is maximized. We formalize our structured data-augmentation procedure from a causal latent-variable-model perspective, and prove identifiability of both content and individual style variables. We empirically demonstrate the benefits of our approach on both synthetic and real-world data.
Paper Structure (53 sections, 1 theorem, 15 equations, 5 figures, 6 tables)

This paper contains 53 sections, 1 theorem, 15 equations, 5 figures, 6 tables.

Key Result

Theorem 4.2

For the data generating process in eq:data_generating_process, eq:different_style_conditionals, and eq:multiple_pairs, assume that Then $\phi_0$ block-identifies vonKugelgen2021 the content $\bm{c}$, and $\{\phi_m\}_{m=1}^M$ identify and disentangle the style latents $s_m$ in the sense that for all $m=1, \dots, M$: $\hat{s}_m=\phi_m(\bm{x})=\psi_m(s_m)$ for some invertible $\psi_m$.

Figures (5)

  • Figure 1: Framework overview. Given $M$ atomic transformations like color distortion or rotation (here, $M\! =\! 2$), we learn a "content" embedding space ($\mathcal{Z}_0$) that is invariant to all transformations and $M$ "style" embedding spaces ($\mathcal{Z}_1$, $\mathcal{Z}_2$) that are each invariant to all-but-the-$m^{\text{th}}$ atomic transformation. To do so, we construct $M\! +\! 1$ transformation pairs $(\bm{t}^m, \bm{t}'^m)$sharing different transformation parameters and use these to create $M{+}1$ transformed image pairs $(\bm{{\tilde{x}}}^m, \bm{{\tilde{x}}}'^m)$sharing different features. After routing each pair to a different space, we: (i) enforce invariance within each space; and (ii) maximize entropy across the joint spaces. The result is $M\! +\! 1$disentangled embedding spaces.
  • Figure 2: Numerical dataset: Recovering only content despite varying embedding sizes.$r^2$ in predicting the ground-truth content $\bm{c}$ and style $\bm{s}$ from the learned embedding $\bm{z}$. For a fixed value of $\lambda$, excess dimensions of $\bm{z}$ are used to capture style. We can prevent this by adapting/increasing $\lambda$. Note: $\dim (\bm{c})\! =\! 5$.
  • Figure 3: ImageNet: Improving downstream performance by keeping more style. We report linear-probe performance on ImageNet and grouped downstream tasks when using both $\bm{z}$ (left, post projector) and $\bm{h}$ (right, pre projector). 13 downstream tasks are grouped by the information they most depend on: spatial, appearance or other (not spatial- or appearance-dominant). We also report the average over tasks (see \ref{['tab:imagenet-full']} of \ref{['app:further-results:imagenet']} for per-task results) rather than over groups. All bars show top-1 accuracy (%) except for those in the Spatial group, which show $r^2$.
  • Figure 4: ColorDSprites: Varying augmentation strengths. Columns show augmentation pairs of the same strength. Note that images are more similar across (a) & (b) than across (c) & (d), in terms of the following style attributes: color, orientation, scale, translation and X-Y position.
  • Figure 5: Comparison with xiao2021what. Note the differences in data augmentation modules, as well as the embedding spaces in which positives and negatives are compared. In particular, note the number of different images in a given batch, with our framework containing more true negatives. See xiao2021what for details on their query-key notation.

Theorems & Definitions (4)

  • Example 1.1: Color and Rotation
  • Remark 4.1
  • Theorem 4.2: Identifiability
  • proof