Theoretical Foundations of Representation Learning using Unlabeled Data: Statistics and Optimization
Pascal Esser, Maximilian Fleissner, Debarghya Ghoshdastidar
TL;DR
The work tackles theoretical foundations of representation learning from unlabeled data, integrating reconstruction-based autoencoders and self-supervised joint-embedding methods. It analyzes optimization dynamics through linear DAEs, denoising AEs, and tensorized AEs, and uses neural tangent kernel (NTK) and spectral-embedding perspectives to characterize learned representations, including cross-moment operators like $\Gamma$. Key results show that linear DAE minimizers converge to $W_* = P_{[k]}(X)(X + A)^{\dagger}$ and that BT/SCL representations align with spectral projections of $\Gamma$, with optimal augmentations derived in RKHS. The work also provides generalization bounds for self-supervised losses and downstream predictors, and outlines open questions on optimal rates, the role of feature/attention learning, and bridging kernel analyses with deep networks in practical unsupervised representation learning.
Abstract
Representation learning from unlabeled data has been extensively studied in statistics, data science and signal processing with a rich literature on techniques for dimension reduction, compression, multi-dimensional scaling among others. However, current deep learning models use new principles for unsupervised representation learning that cannot be easily analyzed using classical theories. For example, visual foundation models have found tremendous success using self-supervision or denoising/masked autoencoders, which effectively learn representations from massive amounts of unlabeled data. However, it remains difficult to characterize the representations learned by these models and to explain why they perform well for diverse prediction tasks or show emergent behavior. To answer these questions, one needs to combine mathematical tools from statistics and optimization. This paper provides an overview of recent theoretical advances in representation learning from unlabeled data and mentions our contributions in this direction.
