Evaluating Self-Supervised Learning via Risk Decomposition

Yann Dubois; Tatsunori Hashimoto; Percy Liang

Evaluating Self-Supervised Learning via Risk Decomposition

Yann Dubois, Tatsunori Hashimoto, Percy Liang

TL;DR

This study introduces an SSL-specific risk decomposition that generalizes the supervised bias/variance framework to capture errors arising from representation learning. It defines four components—representation usability, encoder generalization, probe generalization, and approximation—and provides practical estimators to quantify each using a single pretrained encoder across ImageNet data. Applying these estimators to 169 pretrained SSL models reveals that probe generalization is the current bottleneck, with a notable tradeoff between usability and sample efficiency that shapes full- vs few-shot performance. The findings offer actionable guidance for SSL design (e.g., ViT encoders, larger projection heads, augmentations) and provide a scalable benchmarking toolkit, though they rely on shared pretraining data between encoders and probes. Overall, the work offers a nuanced, quantitative lens for diagnosing SSL errors and guiding design choices in practice, with open-source tooling to reproduce results.

Abstract

Self-supervised learning (SSL) pipelines differ in many design choices such as the architecture, augmentations, or pretraining data. Yet SSL is typically evaluated using a single metric: linear probing on ImageNet. This does not provide much insight into why or when a model is better, now how to improve it. To address this, we propose an SSL risk decomposition, which generalizes the classical supervised approximation-estimation decomposition by considering errors arising from the representation learning step. Our decomposition consists of four error components: approximation, representation usability, probe generalization, and encoder generalization. We provide efficient estimators for each component and use them to analyze the effect of 30 design choices on 169 SSL vision models evaluated on ImageNet. Our analysis gives valuable insights for designing and using SSL models. For example, it highlights the main sources of error and shows how to improve SSL in specific settings (full- vs few-shot) by trading off error components. All results and pretrained models are at https://github.com/YannDubs/SSL-Risk-Decomposition.

Evaluating Self-Supervised Learning via Risk Decomposition

TL;DR

Abstract

Paper Structure (109 sections, 13 equations, 32 figures, 8 tables, 1 algorithm)

This paper contains 109 sections, 13 equations, 32 figures, 8 tables, 1 algorithm.

Introduction
Supervised risk decomposition
SSL risk decomposition
Estimating risk components for SSL
Experimental results
Major sources of errors
Representation usability drove improvements
Probe generalization is now the bottleneck
Encoder generalization is small and constant
Approximation error is negligible
Tradeoffs and full- vs few-shot performance
Predicting performance across settings
Probe generalization signals sample efficiency
Error components predict performance across settings
Tradeoffs
...and 94 more sections

Figures (32)

Figure 1: No model is uniformly better over risk components. "full-shot" axis shows linear probing on ImageNet. Other axes show normalized risk components. Higher is better. Top left (blue) shows average over all 169 models.
Figure 2: The risk decomposition is a path between settings of increasing expected risk for training the probe: 0 $\to$$\mathrm{R}_{ \mathcal{F}}$ (constrained family $\mathcal{F}{}$) $\to$$\mathrm{R}_{ S}$ (finite supervised data).
Figure 3: Our SSL decomposition is a path between settings of increasing expected risk. Columns show probe's limitations (constrained $\mathcal{F}{}$, finite supervised data $S$) as in \ref{['fig:main_row']}. Rows show encoder's limitations (constrained $\Phi$, SSL algorithm $\mathrm{A}_{ \Phi}$, finite unlabeled data $U$). Risk components (colored) are the differences between risks in two settings.
Figure 4: The major SSL improvements came from usability, but probe generalization is now the largest source of error. The plot shows risk components of the best ImageNet-pretrained model published in a given year. Lower is better. In \ref{['appx:sec:results:trends']} we show similar trends for the average models.
Figure 5: Our estimated risk components are tightly related with performance in different settings. (a) Usability error of the best $20\%$ of models increases as the training samples decreases, while probe generalization error decreases. (b) The performance predicted by our scaling law (x-axis) is close to the true performance (y-axis) for all data settings.
...and 27 more figures

Evaluating Self-Supervised Learning via Risk Decomposition

TL;DR

Abstract

Evaluating Self-Supervised Learning via Risk Decomposition

Authors

TL;DR

Abstract

Table of Contents

Figures (32)