Table of Contents
Fetching ...

Evaluating Representations with Readout Model Switching

Yazhe Li, Jorg Bornschein, Marcus Hutter

TL;DR

This paper treats the evaluation of representations as a model selection problem and proposes to use the Minimum Description Length (MDL) principle to devise an evaluation metric, which takes model complexity, as well as data efficiency into account.

Abstract

Although much of the success of Deep Learning builds on learning good representations, a rigorous method to evaluate their quality is lacking. In this paper, we treat the evaluation of representations as a model selection problem and propose to use the Minimum Description Length (MDL) principle to devise an evaluation metric. Contrary to the established practice of limiting the capacity of the readout model, we design a hybrid discrete and continuous-valued model space for the readout models and employ a switching strategy to combine their predictions. The MDL score takes model complexity, as well as data efficiency into account. As a result, the most appropriate model for the specific task and representation will be chosen, making it a unified measure for comparison. The proposed metric can be efficiently computed with an online method and we present results for pre-trained vision encoders of various architectures (ResNet and ViT) and objective functions (supervised and self-supervised) on a range of downstream tasks. We compare our methods with accuracy-based approaches and show that the latter are inconsistent when multiple readout models are used. Finally, we discuss important properties revealed by our evaluations such as model scaling, preferred readout model, and data efficiency.

Evaluating Representations with Readout Model Switching

TL;DR

This paper treats the evaluation of representations as a model selection problem and proposes to use the Minimum Description Length (MDL) principle to devise an evaluation metric, which takes model complexity, as well as data efficiency into account.

Abstract

Although much of the success of Deep Learning builds on learning good representations, a rigorous method to evaluate their quality is lacking. In this paper, we treat the evaluation of representations as a model selection problem and propose to use the Minimum Description Length (MDL) principle to devise an evaluation metric. Contrary to the established practice of limiting the capacity of the readout model, we design a hybrid discrete and continuous-valued model space for the readout models and employ a switching strategy to combine their predictions. The MDL score takes model complexity, as well as data efficiency into account. As a result, the most appropriate model for the specific task and representation will be chosen, making it a unified measure for comparison. The proposed metric can be efficiently computed with an online method and we present results for pre-trained vision encoders of various architectures (ResNet and ViT) and objective functions (supervised and self-supervised) on a range of downstream tasks. We compare our methods with accuracy-based approaches and show that the latter are inconsistent when multiple readout models are used. Finally, we discuss important properties revealed by our evaluations such as model scaling, preferred readout model, and data efficiency.
Paper Structure (35 sections, 3 theorems, 22 equations, 10 figures, 11 tables, 2 algorithms)

This paper contains 35 sections, 3 theorems, 22 equations, 10 figures, 11 tables, 2 algorithms.

Key Result

Theorem 1

Let a sequence $x^N = (x_i)_{i=1}^N$ be sampled i.i.d. from distribution $P$. Let readout models $M_1, \dots M_K$ be in the exponential family. $\mu^*_k=\mathbb{E}_P[T_k(X)]$, where $T_k(.)$ is the sufficient statistic, is an element of the model $M_k$'s mean value parameter space. Denote $\hat{\the

Figures (10)

  • Figure 1: Illustration of switching between models of different complexity: Depending on the number of training examples either $A$, $B$, or $C$ has the best generalization performance. An optimally switched model will have the best performance at each point and thus the lowest prequential description length (= area under the curve).
  • Figure 2: Readout model that most often has the highest probability $\mathop{\mathrm{arg\,max}}\limits_k \frac{1}{T} {\sum_{t=1}^T} p(\xi_t=k|\mathcal{D}_{<t})$ on 19 VTAB datasets. 0 is linear readout; 1 to 7 are MLPs with 1 to 7 hidden layers.
  • Figure 3: Data efficiency: We plot the cumulative next-step log-loss for a range of readouts with a ResNet-50 backbone as a function of (downstream) sample size (in nats, lower is better).
  • Figure 4: Comparison of train and test accuracies on the original and reshuffled VTAB datasets.
  • Figure 5: Regret of Bayesian mixture, elementwise mixture and switching distribution compared to the fixed share baseline on ImageNet. Pretrained model is BYOL with ResNet50 backbone.
  • ...and 5 more figures

Theorems & Definitions (6)

  • Theorem 1
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • proof : Proof of \ref{['theorm:1']}