Modulate Your Spectrum in Self-Supervised Learning

Xi Weng; Yunhao Ni; Tengwei Song; Jie Luo; Rao Muhammad Anwer; Salman Khan; Fahad Shahbaz Khan; Lei Huang

Modulate Your Spectrum in Self-Supervised Learning

Xi Weng, Yunhao Ni, Tengwei Song, Jie Luo, Rao Muhammad Anwer, Salman Khan, Fahad Shahbaz Khan, Lei Huang

TL;DR

Spectral Transformation (ST) is introduced, a framework to modulate the spectrum of embedding and to seek for functions beyond whitening that can avoid dimensional collapse, and its empirical investigations unveil other ST instances capable of preventing collapse.

Abstract

Whitening loss offers a theoretical guarantee against feature collapse in self-supervised learning (SSL) with joint embedding architectures. Typically, it involves a hard whitening approach, transforming the embedding and applying loss to the whitened output. In this work, we introduce Spectral Transformation (ST), a framework to modulate the spectrum of embedding and to seek for functions beyond whitening that can avoid dimensional collapse. We show that whitening is a special instance of ST by definition, and our empirical investigations unveil other ST instances capable of preventing collapse. Additionally, we propose a novel ST instance named IterNorm with trace loss (INTL). Theoretical analysis confirms INTL's efficacy in preventing collapse and modulating the spectrum of embedding toward equal-eigenvalues during optimization. Our experiments on ImageNet classification and COCO object detection demonstrate INTL's potential in learning superior representations. The code is available at https://github.com/winci-ai/INTL.

Modulate Your Spectrum in Self-Supervised Learning

TL;DR

Abstract

Paper Structure (69 sections, 4 theorems, 41 equations, 9 figures, 13 tables)

This paper contains 69 sections, 4 theorems, 41 equations, 9 figures, 13 tables.

Introduction
Related Work
Spectral Transformation beyond Whitening
Preliminary and Notation
Joint embedding architectures.
Feature collapse.
Whitening loss.
Spectral Transformation
Can we unveil the potential of ST?
Spectral Transformation using power functions
Empirical observation.
Intuitive explanation.
Implicit Spectral Transformation using Newton’s Iteration
Iterative Normalization with Trace Loss
Theoretical Analysis
...and 54 more sections

Key Result

Theorem 1

Define one-variable iterative function $f_T(x)$, satisfying The mapping function of IterNorm is $g(\lambda) = f_T(\frac{\lambda}{tr(\Sigma)}) / \sqrt{ tr(\Sigma)}$. Without calculating $\lambda$ or $\mathbf{U}$, IterNorm implicitly maps $\forall \lambda_i \in \lambda(\mathbf{Z}_{})$ to $\widehat{\lambda}_i = \frac{\lambda_i}{tr(\Sigma)} {f_T}^2(\frac{\lambd

Figures (9)

Figure 1: The framework using spectral transformation (ST) to modulate the spectrum of embedding in joint embedding architecture for SSL.
Figure 2: Investigate ST using power functions. We choose several $p$ from $0$ to $1.5$. We show (a) the visualization of the toy model output; (b) top-1 and 5-nearest neighbors (5-nn) accuracy on CIFAR-10; (c) condition indicator of embedding $\mathbf{Z}_{}$ and transformed output $\widehat{\mathbf{Z}}_{}$ on CIFAR-10. We use the inverse of the condition number (IoC) in logarithmic scale with base 10 ( $lg{IoC} = lg{c^{-1}} = lg\frac{\lambda_d}{\lambda_1}$) as the condition indicator. The results on CIFAR-10 are obtained through training with ResNet-18 for 200 epochs and averaged over five runs, with standard deviation shown as error bars. We show the details of experimental setup in Appendix\ref{['sec:analytical_experiments']}. Similar phenomena can be observed when using other datasets (e.g., ImageNet) and other networks (e.g., ResNet-50).
Figure 3: Investigate the effectiveness of IterNorm with and without trace loss. We train the models on CIFAR-10 with ResNet-18 for 100 epochs. We apply IterNorm with various iteration numbers $T$, and show the results with (solid lines) and without (dashed lines) trace loss respectively. (a) The spectrum of the embedding $\mathbf{Z}_{}$; (b) The spectrum of the transformed output $\widehat{\mathbf{Z}}_{}$; (c) The top-1 accuracy. (d) indicates that IterNorm (without trace loss) suffers from numeric divergence when using a large iteration number, e.g. $T=9$. It is noteworthy that when $T\geq11$, the loss values are all NAN, making the model unable to be trained. Similar phenomena can be observed when using other datasets (e.g., ImageNet) and other networks (e.g., ResNet-50).
Figure 4: Algorithm of INTL, PyTorch-style Pseudocode.
Figure 5: Visualization of our synthetic 2D dataset. We show (a) the distribution of our 2D dataset; (b) the initial output of the toy Siamese network.
...and 4 more figures

Theorems & Definitions (9)

Definition 1
Theorem 1
Proposition 1
Proposition 2
Theorem 2
proof
proof
proof
proof

Modulate Your Spectrum in Self-Supervised Learning

TL;DR

Abstract

Modulate Your Spectrum in Self-Supervised Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (9)