Table of Contents
Fetching ...

On the Asymptotics of Self-Supervised Pre-training: Two-Stage M-Estimation and Representation Symmetry

Mohammad Tinati, Stephen Tu

Abstract

Self-supervised pre-training, where large corpora of unlabeled data are used to learn representations for downstream fine-tuning, has become a cornerstone of modern machine learning. While a growing body of theoretical work has begun to analyze this paradigm, existing bounds leave open the question of how sharp the current rates are, and whether they accurately capture the complex interaction between pre-training and fine-tuning. In this paper, we address this gap by developing an asymptotic theory of pre-training via two-stage M-estimation. A key challenge is that the pre-training estimator is often identifiable only up to a group symmetry, a feature common in representation learning that requires careful treatment. We address this issue using tools from Riemannian geometry to study the intrinsic parameters of the pre-training representation, which we link with the downstream predictor through a notion of orbit-invariance, precisely characterizing the limiting distribution of the downstream test risk. We apply our main result to several case studies, including spectral pre-training, factor models, and Gaussian mixture models, and obtain substantial improvements in problem-specific factors over prior art when applicable.

On the Asymptotics of Self-Supervised Pre-training: Two-Stage M-Estimation and Representation Symmetry

Abstract

Self-supervised pre-training, where large corpora of unlabeled data are used to learn representations for downstream fine-tuning, has become a cornerstone of modern machine learning. While a growing body of theoretical work has begun to analyze this paradigm, existing bounds leave open the question of how sharp the current rates are, and whether they accurately capture the complex interaction between pre-training and fine-tuning. In this paper, we address this gap by developing an asymptotic theory of pre-training via two-stage M-estimation. A key challenge is that the pre-training estimator is often identifiable only up to a group symmetry, a feature common in representation learning that requires careful treatment. We address this issue using tools from Riemannian geometry to study the intrinsic parameters of the pre-training representation, which we link with the downstream predictor through a notion of orbit-invariance, precisely characterizing the limiting distribution of the downstream test risk. We apply our main result to several case studies, including spectral pre-training, factor models, and Gaussian mixture models, and obtain substantial improvements in problem-specific factors over prior art when applicable.

Paper Structure

This paper contains 148 sections, 66 theorems, 515 equations, 1 figure.

Key Result

Theorem 4.1

Assume the smoothness and identifiability conditions on $\Omega_\star$ described above, in addition to the local uniform laws needed for a second-order expansion of $\hat{L}_{\mathrm{pre}}$ around $\Omega_\star$. Define $v_m\coloneqq \log_{\Omega_\star}(\hat{\Omega}_m)\in T_{\Omega_\star}\mathcal{M}

Figures (1)

  • Figure 1: Numerical evaluation of the limiting quantity $\mathbb{E} \|\mathcal{L}(Z)\|_{L^2(\mu_{\mathrm{down}})}^2$ in the Gaussian mixture example, for a block-structured signal where each mixture component $i \in [K]$ is associated with a parameter block $(\theta_i^\star, b_i^\star)$ proportional to $\frac{1}{i}\,\mathbf{1}_{r_\star+1}$, with the full vector $\theta_\star = (\theta_1^\star, b_1^\star, \ldots, \theta_K^\star, b_K^\star)$ normalized to unit Euclidean norm. (a) varies the number of blocks $K$ with $d=30$ and $\beta=2.0$, showing a monotonically increasing concave trend well captured by a quadratic fit $aK^2 + bK + c$ ($R^2 = 0.99$); (b) varies the ambient dimension $d$ with $K=4$ and $\beta=2.0$, exhibiting slow growth consistent with a logarithmic fit $a + b \log d$ ($R^2 = 0.89$); (c) varies $\beta$ with $K=4$ and $d=20$, displaying rapid decay consistent with a power-law fit $C \beta^{\alpha}$ ($R^2 = 0.90$). These results suggest that the scaling behavior of the interaction term is complex and may differ across parameter regimes, with distinct behaviors potentially emerging at small and large values of $K$, $d$, and $\beta$.

Theorems & Definitions (132)

  • Theorem 4.1: Asymptotic normality on the descriptor manifold, informal
  • Lemma 4.2: Orbit-invariance of the minimum-norm downstream predictor
  • Remark 5.1: Representation dependence vs. intrinsic objects
  • Proposition 5.1: Exact conditional risk decomposition
  • Theorem 5.2: Main result: asymptotic behavior of the conditional excess test risk
  • Corollary 6.2: Linear spectral model
  • Corollary 6.3
  • Corollary 6.4
  • Lemma 6.6
  • Lemma A.1: Basic pseudoinverse identities
  • ...and 122 more