S-JEA: Stacked Joint Embedding Architectures for Self-Supervised Visual Representation Learning

Alžběta Manová; Aiden Durrant; Georgios Leontidis

S-JEA: Stacked Joint Embedding Architectures for Self-Supervised Visual Representation Learning

Alžběta Manová, Aiden Durrant, Georgios Leontidis

TL;DR

This work empirically shows that representations from stacked JEA perform on a similar level as traditional JEA with comparative parameter counts and visualise the representation spaces to validate the semantic hierarchies.

Abstract

The recent emergence of Self-Supervised Learning (SSL) as a fundamental paradigm for learning image representations has, and continues to, demonstrate high empirical success in a variety of tasks. However, most SSL approaches fail to learn embeddings that capture hierarchical semantic concepts that are separable and interpretable. In this work, we aim to learn highly separable semantic hierarchical representations by stacking Joint Embedding Architectures (JEA) where higher-level JEAs are input with representations of lower-level JEA. This results in a representation space that exhibits distinct sub-categories of semantic concepts (e.g., model and colour of vehicles) in higher-level JEAs. We empirically show that representations from stacked JEA perform on a similar level as traditional JEA with comparative parameter counts and visualise the representation spaces to validate the semantic hierarchies.

S-JEA: Stacked Joint Embedding Architectures for Self-Supervised Visual Representation Learning

TL;DR

Abstract

Paper Structure (14 sections, 1 equation, 4 figures, 3 tables, 1 algorithm)

This paper contains 14 sections, 1 equation, 4 figures, 3 tables, 1 algorithm.

Introduction
Motivation and Contributions
Related Work
Self-Supervised Learning
Learning Hierarchies
Stacked Joint Embedding Architecture
Base JEA: VICReg
Stacking JEAs
Empirical Results
Linear Evaluation
Comparing Deep Vs. Stacked
Projection Head
Semantic Hierarchies
Conclusion and Future Work

Figures (4)

Figure 1: S-JEA: Stacked Joint Embedding Architecture with 2 stacks. A batch of images $X$ is transformed under two sets of transformations ($\tau_1$ and $\tau_2$) to produce two batches of views $V_1$ and $V_2$. The views are input to the first level encoder $f_\theta$ resulting in representations $Y^{[0]}_1$ and $Y^{[0]}_2$, for the first level stack the representations are expanded by $h_\theta$ producing embeddings $Z^{[0]}_1$ and $Z^{[0]}_2$. The second level stack takes the first level representations $Y^{[0]}_1$ and $Y^{[0]}_2$ and encodes them by $f_\xi$ to produce representations $Y^{[1]}_1$ and $Y^{[1]}_2$ which are expanded to form embeddings $Z^{[1]}_1$ and $Z^{[1]}_2$. For both stacked levels we use the VICReg loss applied at the embedding level of each stack, and the final loss is a weighted summation.
Figure 2: Visualisation of Learned Representations on CIFAR-10. t-SNE plots of the CIFAR-10 test set representations frozen encoders trained by VICReg and S-JEA at each stack.
Figure 3: Visualisation of Semantic Sub-Clusters of S-JEA. t-SNE plot of STL-10 test set representations of the frozen pre-trained stacked encoder. The representations shown correspond to the semantic class label 'cars'. Of the three identified sub-clusters, the top shows image representations pertaining to side view poses of cars, the bottom right consists of forward facing pose, whilst the bottom-left sub-cluster contains all race cars.
Figure 4: Covariance Loss Term During Training. Covariance loss term for both first level (blue) and higher level stacked encoders (orange) during training. The covariance loss term when weighted down is also presented (grey).

S-JEA: Stacked Joint Embedding Architectures for Self-Supervised Visual Representation Learning

TL;DR

Abstract

S-JEA: Stacked Joint Embedding Architectures for Self-Supervised Visual Representation Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (4)