Table of Contents
Fetching ...

Theoretical Foundations of Representation Learning using Unlabeled Data: Statistics and Optimization

Pascal Esser, Maximilian Fleissner, Debarghya Ghoshdastidar

TL;DR

The work tackles theoretical foundations of representation learning from unlabeled data, integrating reconstruction-based autoencoders and self-supervised joint-embedding methods. It analyzes optimization dynamics through linear DAEs, denoising AEs, and tensorized AEs, and uses neural tangent kernel (NTK) and spectral-embedding perspectives to characterize learned representations, including cross-moment operators like $\Gamma$. Key results show that linear DAE minimizers converge to $W_* = P_{[k]}(X)(X + A)^{\dagger}$ and that BT/SCL representations align with spectral projections of $\Gamma$, with optimal augmentations derived in RKHS. The work also provides generalization bounds for self-supervised losses and downstream predictors, and outlines open questions on optimal rates, the role of feature/attention learning, and bridging kernel analyses with deep networks in practical unsupervised representation learning.

Abstract

Representation learning from unlabeled data has been extensively studied in statistics, data science and signal processing with a rich literature on techniques for dimension reduction, compression, multi-dimensional scaling among others. However, current deep learning models use new principles for unsupervised representation learning that cannot be easily analyzed using classical theories. For example, visual foundation models have found tremendous success using self-supervision or denoising/masked autoencoders, which effectively learn representations from massive amounts of unlabeled data. However, it remains difficult to characterize the representations learned by these models and to explain why they perform well for diverse prediction tasks or show emergent behavior. To answer these questions, one needs to combine mathematical tools from statistics and optimization. This paper provides an overview of recent theoretical advances in representation learning from unlabeled data and mentions our contributions in this direction.

Theoretical Foundations of Representation Learning using Unlabeled Data: Statistics and Optimization

TL;DR

The work tackles theoretical foundations of representation learning from unlabeled data, integrating reconstruction-based autoencoders and self-supervised joint-embedding methods. It analyzes optimization dynamics through linear DAEs, denoising AEs, and tensorized AEs, and uses neural tangent kernel (NTK) and spectral-embedding perspectives to characterize learned representations, including cross-moment operators like . Key results show that linear DAE minimizers converge to and that BT/SCL representations align with spectral projections of , with optimal augmentations derived in RKHS. The work also provides generalization bounds for self-supervised losses and downstream predictors, and outlines open questions on optimal rates, the role of feature/attention learning, and bridging kernel analyses with deep networks in practical unsupervised representation learning.

Abstract

Representation learning from unlabeled data has been extensively studied in statistics, data science and signal processing with a rich literature on techniques for dimension reduction, compression, multi-dimensional scaling among others. However, current deep learning models use new principles for unsupervised representation learning that cannot be easily analyzed using classical theories. For example, visual foundation models have found tremendous success using self-supervision or denoising/masked autoencoders, which effectively learn representations from massive amounts of unlabeled data. However, it remains difficult to characterize the representations learned by these models and to explain why they perform well for diverse prediction tasks or show emergent behavior. To answer these questions, one needs to combine mathematical tools from statistics and optimization. This paper provides an overview of recent theoretical advances in representation learning from unlabeled data and mentions our contributions in this direction.

Paper Structure

This paper contains 15 sections, 9 theorems, 25 equations, 6 figures, 1 table.

Key Result

Theorem 2.1

Consider Equation eq:DAE_linear in the ridgeless limit $(\lambda\to0)$, and denote $\boldsymbol{W}_* = \boldsymbol{W}_2 \boldsymbol{W}_1$. The global minimizer $\boldsymbol{W}_*$ of Equation eq:DAE_linear converges to $\boldsymbol{W}_* = P_{[k]}(\boldsymbol{X})(\boldsymbol{X} + \boldsymbol{A})^{\dag

Figures (6)

  • Figure 1: Illustration of emergent properties of self-supervised representation learning in a simple setting. The left plot shows the manifold of an example in $\mathbb{R}^3$, where we assume that data is generated from an union $2$ classes (two intersecting $2$-dimensional discs). Assume that one has access to 15 labeled samples and $N$ unlabeled samples with $N$ varying from $10$ to $1000$. The goal is to learn an "informative representation" from the unlabeled data so that a simple linear classifier, trained using only 15 labeled examples, correctly predicts the class labels for new samples. We consider the case where one learns representations in $\mathbb{R}^2$ learned using the unlabeled $N$ samples. One could use traditional unsupervised methods like kernel principal component analysis (kernel PCA) scholkopf1998nonlinear, or self-supervised techniques. We specifically consider a joint-embedding approach VICReg BardesPL22CabannesKBLB23, where for each unlabeled sample $x\in\mathbb{R}^3$, one generates a random augmented view $x^+$ either by masking each coordinate of $x$ with probability 0.5, or by rotating the sample within the disc that it lies in (the latter is a hypothetical label-dependent augmentation based on the philosophy that rotating images preserves their semantic meaning and the augmented data remains on same manifold). The middle plot shows the embedding of 500 labeled test samples, where the representations $f:\mathbb{R}^3\to\mathbb{R}^2$ is learned with $N= 1000$ unlabeled examples. The plot shows that $f (\cdot)$ from VICReg, with either augmentation, almost separates the two classes, whereas kernel PCA learns an uninformative representation. The right plot shows the downstream predictive performance of unsupervised representation learning with varying $N$, averaged over $100$ independent runs. The downstream classifier is a linear support vector machine (SVM) trained on the representation $f (\cdot)$ of the 15 labeled samples. We also include two baselines that do not use representation learning---a supervised kernel SVM trained on the 15 labeled samples in $\mathbb{R}^3$, and a semi-supervised approach of self-labeling, where kernel SVM predicts on available $N$ unlabeled data, and uses the pseudo-labels to update the model. Supervised kernel SVM provides a baseline of 70% accuracy, which does not improve with semi-supervised techniques. Unsupervised representation learning with kernel PCA does not learn "more informative" representations with more unlabeled data, whereas VICReg shows emergent behavior---the downstream classification performance increases with the availability of large amount of unlabeled, augmented data. The improvement is particularly insightful for VICReg with label-dependent augmentation, whose performance is similar to kernel PCA when there are few unlabeled samples, but significantly improves when more unlabeled data is used.
  • Figure 2: Illustration of two principles for representation learning.(Left) The objective of reconstruction. An data instance is mapped into a lower dimensional latent space using an encoder function and then mapped back to the original feature space using a reconstruction function. The functions are learned by minimizing the distance of the reconstruction from either the given instance or its augmentation. (Right) The objective of joint embedding. This principle builds on the idea that semantically similar pair of data instances, usually obtained through data augmentation, should be embedded close to each other in the latent space. Hence, the embedding function is learned by minimizing the distance between the embedding augmented pairs of data instances, incorporating additional measures to ensure that trivial embeddings are not learned.
  • Figure 3: Source EsserMSG23. (Left) Illustration of the Simpson's paradox Simpson1951stat. The scatter plot shows the dataset of 2-dimensional features for three different species of penguins Penguins. The three clusters, for the different species, and their first principal component are plotted in red, blue and green, respectively. A linear AE or PCA into $k=1$ dimensional latent space can only recover the principal component of the full dataset (shown in black), but cannot capture the characteristics of the individual species (clusters). Such examples of Simpson's paradox can only be found in non-linear models and real-world applications such as social or science and medical science. (Right) Performance of different clustering algorithms on Simpson's paradox data. We consider a synthetic version of Simpson's paradox with noisy samples from two parallel lines in $\mathbb{R}^2$. One can either apply $k$-means++ on the original data (right), or on learned representations $f:\mathbb{R}^2\to\mathbb{R}$. AE recovers the principal component of the full data, which does align with the direction of the clusters. Hence, clustering the representations from AE does not recover the true clusters (middle). EsserMSG23 introduce tensorized AE, which learns representations for individual clusters, and results in better clustering performance.
  • Figure 4: Source abs-2505-24668. Impact of bottleneck dimension of linear DAE on generalization error. We plot the test errors of linear DAEs with and without skip connection on CIFAR-10, illustrating how the bottleneck dimension $k$ and $c=\frac{d}{N}$ jointly influence generalization. For the experiments, each sample was reshaped into a $3072$-dimensional vector, and the rank of the dataset was set to $r=100$ using SVD. Since the dataset has a fixed ambient dimension $d$, our numerical experiments focus on varying the number of training samples $N$. (Left & Middle) The left and middle plot show the denoising error on test data with varying $c = \frac{d}{N}$ for the linear DAEs without and with skip connections, respectively. Both plots demonstrate that the optimal choice of $k$ depends on the over-parameterization ratio $c$, reflecting a distinct bias-variance trade-off in different regimes. (Right) To study the impact of over-parameterization, the right plot is constructed by jointly increasing both $k$ and $c$ in the model without skip connections. In particular, this plot demonstrates that jointly increasing $c$ and bottleneck dimension $k$ leads to a second peak in the test curve within the overparameterized regime.
  • Figure 5: Source fleissner2025infinite. Numerical evidence of constancy of NTK under Barlow Twins loss minimization. We verify Theorem \ref{['thm:NTK']} by training a 1-hidden layer network with tanh activation on the MNIST dataset. We use gradient descent with a learning rate of 0.5, and train till loss $\mathcal{L}_{BT} \leq 10^{-5}$. The results are averaged over 10 independent runs. For a fixed sample size $N$, we plot different quantities for varying network width $M$. We then vary $N$ and plot: (a) NTK change till convergence, where we see that as width increases, there is less change in NTK between initialization and convergence; (b) training epochs till convergence, which shows that the time to convergence remains almost constant with the network width; (c) squared norm of difference between representations of neural network and corresponding kernel model under the Barlow Twins loss, which validates that one can use the optimal solution of the kernel model (see Section 3.2) as a good approximation for the representation learned by a neural network.
  • ...and 1 more figures

Theorems & Definitions (9)

  • Theorem 2.1: Global minimizer of linear DAEs abs-2505-24668
  • Theorem 2.2: Generalization error of over-parametrized linear DAEs abs-2505-24668
  • Theorem 2.3: Global minimizer and generalization error of linear DAE with Skip Connection abs-2505-24668
  • Theorem 2.4: Parameterization at optimal for TAE EsserMSG23
  • Theorem 3.1: Constancy of the NTK under Barlow Twins loss minimization fleissner2025infinite
  • Theorem 3.2: Optimal representations learned by kernel Barlow Twins model, adapted from SimonKLGFA23
  • Theorem 3.3: Optimal data augmentation for spectral contrastive and Barlow Twins models feigin2024theoretical
  • Theorem 3.4: Generalization error bound for kernel spectral contrastive model, adapted from EsserFG24
  • Theorem 4.1: Generalization error bounds for supervised cross entropy loss in terms of SimCLR loss vanElstG25