Table of Contents
Fetching ...

Predicting the Performance of Foundation Models via Agreement-on-the-Line

Rahul Saxena, Taeyoun Kim, Aman Mehra, Christina Baek, Zico Kolter, Aditi Raghunathan

TL;DR

This work demonstrates that when lightly finetuning multiple runs from a single foundation model, the choice of randomness during training can lead to drastically different levels of agreement-on-the-line in the resulting ensemble and demonstrates that only random head initialization is able to reliably induce agreement-on-the-line in finetuned foundation models across vision and language benchmarks.

Abstract

Estimating the out-of-distribution performance in regimes where labels are scarce is critical to safely deploy foundation models. Recently, it was shown that ensembles of neural networks observe the phenomena "agreement-on-the-line", which can be leveraged to reliably predict OOD performance without labels. However, in contrast to classical neural networks that are trained on in-distribution data from scratch for numerous epochs, foundation models undergo minimal finetuning from heavily pretrained weights, which may reduce the ensemble diversity needed to observe agreement-on-the-line. In our work, we demonstrate that when lightly finetuning multiple runs from a single foundation model, the choice of randomness during training (linear head initialization, data ordering, and data subsetting) can lead to drastically different levels of agreement-on-the-line in the resulting ensemble. Surprisingly, only random head initialization is able to reliably induce agreement-on-the-line in finetuned foundation models across vision and language benchmarks. Second, we demonstrate that ensembles of multiple foundation models pretrained on different datasets but finetuned on the same task can also show agreement-on-the-line. In total, by careful construction of a diverse ensemble, we can utilize agreement-on-the-line-based methods to predict the OOD performance of foundation models with high precision.

Predicting the Performance of Foundation Models via Agreement-on-the-Line

TL;DR

This work demonstrates that when lightly finetuning multiple runs from a single foundation model, the choice of randomness during training can lead to drastically different levels of agreement-on-the-line in the resulting ensemble and demonstrates that only random head initialization is able to reliably induce agreement-on-the-line in finetuned foundation models across vision and language benchmarks.

Abstract

Estimating the out-of-distribution performance in regimes where labels are scarce is critical to safely deploy foundation models. Recently, it was shown that ensembles of neural networks observe the phenomena "agreement-on-the-line", which can be leveraged to reliably predict OOD performance without labels. However, in contrast to classical neural networks that are trained on in-distribution data from scratch for numerous epochs, foundation models undergo minimal finetuning from heavily pretrained weights, which may reduce the ensemble diversity needed to observe agreement-on-the-line. In our work, we demonstrate that when lightly finetuning multiple runs from a single foundation model, the choice of randomness during training (linear head initialization, data ordering, and data subsetting) can lead to drastically different levels of agreement-on-the-line in the resulting ensemble. Surprisingly, only random head initialization is able to reliably induce agreement-on-the-line in finetuned foundation models across vision and language benchmarks. Second, we demonstrate that ensembles of multiple foundation models pretrained on different datasets but finetuned on the same task can also show agreement-on-the-line. In total, by careful construction of a diverse ensemble, we can utilize agreement-on-the-line-based methods to predict the OOD performance of foundation models with high precision.
Paper Structure (58 sections, 8 equations, 38 figures, 18 tables)

This paper contains 58 sections, 8 equations, 38 figures, 18 tables.

Figures (38)

  • Figure 1: The ID vs OOD lines for accuracy (orange) and agreement (blue) for various datasets and fine-tuned ensembles. Each blue dot corresponds to a member of the ensemble and represents the ID (x) and OOD (y) accuracy. Each orange dot corresponds to a pair of these members and represents the ID (x) and OOD (y) agreement. From CIFAR10 to CIFAR10C "Pixelate" in linear probed CLIP, MNLI to SNLI in full fine-tuned OPT, and SQuAD to SQuAD-Shifts "Amazon" in full fine-tuned GPT2, we observe that randomly initializing the head as the diversity source for generating ensembles (columns) shows the closest agreement linear fit to accuracy.
  • Figure 2: In ensembles with diverse random initializations, ACL and AGL holds across benchmarks in linear probed CLIP models. Similar to baek2022agreement, neither ACL nor AGL holds for the Camelyon17-WILDS
  • Figure 3: AGL can be observed between models finetuned from different base models (Llama, GPT, OPT) for the F1 score for question-answering shift (SQuAD to SQuAD-Shifts) and accuracy for text classification (MNLI-Matched to MNLI-Mismatched and SNLI). SQuAD-Shifts New Wiki, SQuAD-Shifts NYT, and MNLI Mismatched show little drop in OOD performance because the distribution shift is small compared to the corresponding ID dataset. Nonetheless, we observe that AGL holds regardless of the degree of distribution shift.
  • Figure 4: ID vs OOD accuracy and agreement of linear probed CLIP models on OfficeHome Art (top row), Product (middle row), and Real World (bottom row). The figure title is the OOD domain.
  • Figure 5: AGL and ACL for all C10C shifts with random head initialization finetuning.
  • ...and 33 more figures