Table of Contents
Fetching ...

Comparing Foundation Models using Data Kernels

Brandon Duderstadt, Hayden S. Helm, Carey E. Priebe

TL;DR

This work addresses the challenge of comparing foundation models without committing to a single downstream metric by focusing on the geometry of embedding spaces. It builds a data-kernel $A = \text{TOP}_k(YY^{\top})$ and models it as a Random Dot Product Graph $A \sim \text{RDPG}(ZZ^{\top})$, enabling consistent latent-position estimation via adjacency spectral embedding up to orthogonal transformations. A joint omnibus embedding aligns multiple data kernels to enable per-datum hypothesis testing through bootstrap, and an ablation study demonstrates its capacity to surface representation changes due to data interventions. Extending to population-level analysis, the paper defines a model-manifold distance based on aligned latent positions, showing that manifold distance correlates with downstream metrics like classifier agreement and pseudo-perplexity, thereby supporting a taxonomic view of foundation-model families and suggesting avenues for model selection and privacy-aware analysis.

Abstract

Recent advances in self-supervised learning and neural network scaling have enabled the creation of large models, known as foundation models, which can be easily adapted to a wide range of downstream tasks. The current paradigm for comparing foundation models involves evaluating them with aggregate metrics on various benchmark datasets. This method of model comparison is heavily dependent on the chosen evaluation metric, which makes it unsuitable for situations where the ideal metric is either not obvious or unavailable. In this work, we present a methodology for directly comparing the embedding space geometry of foundation models, which facilitates model comparison without the need for an explicit evaluation metric. Our methodology is grounded in random graph theory and enables valid hypothesis testing of embedding similarity on a per-datum basis. Further, we demonstrate how our methodology can be extended to facilitate population level model comparison. In particular, we show how our framework can induce a manifold of models equipped with a distance function that correlates strongly with several downstream metrics. We remark on the utility of this population level model comparison as a first step towards a taxonomic science of foundation models.

Comparing Foundation Models using Data Kernels

TL;DR

This work addresses the challenge of comparing foundation models without committing to a single downstream metric by focusing on the geometry of embedding spaces. It builds a data-kernel and models it as a Random Dot Product Graph , enabling consistent latent-position estimation via adjacency spectral embedding up to orthogonal transformations. A joint omnibus embedding aligns multiple data kernels to enable per-datum hypothesis testing through bootstrap, and an ablation study demonstrates its capacity to surface representation changes due to data interventions. Extending to population-level analysis, the paper defines a model-manifold distance based on aligned latent positions, showing that manifold distance correlates with downstream metrics like classifier agreement and pseudo-perplexity, thereby supporting a taxonomic view of foundation-model families and suggesting avenues for model selection and privacy-aware analysis.

Abstract

Recent advances in self-supervised learning and neural network scaling have enabled the creation of large models, known as foundation models, which can be easily adapted to a wide range of downstream tasks. The current paradigm for comparing foundation models involves evaluating them with aggregate metrics on various benchmark datasets. This method of model comparison is heavily dependent on the chosen evaluation metric, which makes it unsuitable for situations where the ideal metric is either not obvious or unavailable. In this work, we present a methodology for directly comparing the embedding space geometry of foundation models, which facilitates model comparison without the need for an explicit evaluation metric. Our methodology is grounded in random graph theory and enables valid hypothesis testing of embedding similarity on a per-datum basis. Further, we demonstrate how our methodology can be extended to facilitate population level model comparison. In particular, we show how our framework can induce a manifold of models equipped with a distance function that correlates strongly with several downstream metrics. We remark on the utility of this population level model comparison as a first step towards a taxonomic science of foundation models.
Paper Structure (19 sections, 5 equations, 3 figures, 2 algorithms)

This paper contains 19 sections, 5 equations, 3 figures, 2 algorithms.

Figures (3)

  • Figure 1: The UMAP projections of the adjacency spectral embeddings of the BERT data kernel (left), a data kernel sampled from RDPG-BERT (center-left), and the joint embeddings (center-right) for a random subset of $10,000$ English Wikipedia articles. Each article is represented once in the left and center-left figures and twice in the center-right figure. The right panel shows the joint embeddings corresponding to the BERT data kernel with a random set of 10 articles emphasized. The concentric circles around the emphasized articles have radii equal to the 68th, 90th, and 99th percentiles of the bootstrap null distribution described in Section \ref{['sec:hypothesis-testing']}.
  • Figure 2: Individual embeddings (top left), aligned embeddings (bottom left), plant-highlighted (top center), comparison (bottom center), and bootstrapped hypothesis test (right) comparing the representations of language models trained on different corpora -- one baseline and one plant ablated. The data kernels under study are defined on a random subset of 10,000 documents from the DBPedia14 evaluation set.
  • Figure 3: Theoretical (the 2-d simplex) and empirical manifolds induced via multi-dimensional scaling of low-dimensional representations of data kernels (left). The manifold distance correlates strongly with classifier similarity to a landmark model (right) and pseudo-perplexity on a language modeling task (center).