Consistent estimation of generative model representations in the data kernel perspective space

Aranyak Acharyya; Michael W. Trosset; Carey E. Priebe; Hayden S. Helm

Consistent estimation of generative model representations in the data kernel perspective space

Aranyak Acharyya, Michael W. Trosset, Carey E. Priebe, Hayden S. Helm

TL;DR

The paper addresses how to consistently estimate a low-dimensional data kernel perspective space (DKPS) for a collection of generative models from their responses to a set of queries. It adopts raw-stress multidimensional scaling (MDS) to embed model mean discrepancies into $\mathbb{R}^d$, yielding $\hat{\boldsymbol{\psi}}=\mathrm{mds}(\mathbf{D})$ that approximates the population DKPS $\boldsymbol{\psi}=\mathrm{mds}(\boldsymbol{\Delta})$, with $D_{ii'}=\frac{1}{m}\|\bar{\mathbf{X}}_{i}-\bar{\mathbf{X}}_{i'}\|_F$ and $\Delta_{ii'}=\frac{1}{m}\|\boldsymbol{\mu}_i-\boldsymbol{\mu}_{i'}\|_F$. The authors establish sufficient conditions for consistency under three growth regimes of models and queries, linking convergence to the behavior of replication counts $r$, query count $m$, and model count $n$, and they provide empirical validation on language models and text-to-image models. The main contributions are the theoretical results (theorems and corollaries) that guarantee convergence of the estimated DKPS to the population DKPS up to affine transformations, along with practical guidance on how replication, query, and model growth influence estimation quality. This work provides a principled foundation for comparing and tracking evolution of generative models via a principled, low-dimensional embedding framework, with implications for model safety, data mixture inference, and benchmarking.

Abstract

Generative models, such as large language models and text-to-image diffusion models, produce relevant information when presented a query. Different models may produce different information when presented the same query. As the landscape of generative models evolves, it is important to develop techniques to study and analyze differences in model behaviour. In this paper we present novel theoretical results for embedding-based representations of generative models in the context of a set of queries. In particular, we establish sufficient conditions for the consistent estimation of the model embeddings in situations where the query set and the number of models grow.

Consistent estimation of generative model representations in the data kernel perspective space

TL;DR

, yielding

that approximates the population DKPS

, with

and

. The authors establish sufficient conditions for consistency under three growth regimes of models and queries, linking convergence to the behavior of replication counts

, query count

, and model count

, and they provide empirical validation on language models and text-to-image models. The main contributions are the theoretical results (theorems and corollaries) that guarantee convergence of the estimated DKPS to the population DKPS up to affine transformations, along with practical guidance on how replication, query, and model growth influence estimation quality. This work provides a principled foundation for comparing and tracking evolution of generative models via a principled, low-dimensional embedding framework, with implications for model safety, data mixture inference, and benchmarking.

Abstract

Paper Structure (18 sections, 8 theorems, 28 equations, 3 figures)

This paper contains 18 sections, 8 theorems, 28 equations, 3 figures.

Introduction
Preliminaries
Notations
Multidimensional scaling by raw stress embedding
Description of the setting
Consistency of estimated model embeddings
Fixed set of models and fixed set of queries
Fixed set of models and growing set of queries
Growing set of models and growing set of queries
Numerical experiments
Fixed collection of models and fixed set of queries
Fixed collection of models and growing set of queries
Growing collection of models and growing set of queries
Discussion
Proofs of lemmas and theorems
...and 3 more sections

Key Result

Theorem 1

Let $\widehat{\boldsymbol{\psi}} \in \mathrm{MDS}(\mathbf{D}) \subset \mathbb{R}^{n \times d}$. Then there exists a subsequence $\lbrace r_u \rbrace_{u=1}^{\infty}$ of $\lbrace r \rbrace_{r=1}^{\infty}$ such that for all pairs $(i,i') \in [n]^2$, where $\boldsymbol{\psi}= [\boldsymbol{\psi}_1|\boldsymbol{\psi}_2|\dots | \boldsymbol{\psi}_n]^T \in \mathrm{MDS}(\boldsymbol{\Delta}) \subset \mathbb{

Figures (3)

Figure 1: Numerical evidence of the consistency of $\widehat{\boldsymbol{\psi}}$ to $\boldsymbol{\psi}$ for fixed $n$ and fixed $m$ for a collection of language models (left) and a collection of text-to-image models (right). The black dots in the left figure of each pair are the 2-d perspectives of models induced with randomly selected queries, $R$ replicates each, and a domain-specific embedding function. The red circles have radius equal to the average $L_{2}$ between $\widehat{\psi}^{*}$ and model representations estimated with $r$ replicates for each query. The right figure of each pair shows the distribution of the average $L_{2}$ norm in the DKPS across models for various $r$. More replicates improves estimation quality.
Figure 2: Numerical evidence of the consistency of $\widehat{\boldsymbol{\psi}}$ to $\boldsymbol{\psi}$ for fixed $n$ and growing $m$. The black dots in the left of each pair of figures are the 2-d perspectives of $n$ models induced with $M$ queries and $R$ replicates each. The red circles have radius equal to the average $L_{2}$ norm between the "ground truth" and model representations estimated for selected $(m, r)$ pairs. The right figures show the average maximum row $L_{2}$ norm for various $(m, r)$ pairs. More replicates and more queries improves estimation quality. The number of queries appears to have a larger effect.
Figure 3: Numerical evidence of the consistency of $\widehat{\boldsymbol{\psi}}$ to $\boldsymbol{\psi}$ for growing $n$ and growing $m$. The two target DKPS $\widehat{\psi}$ were estimated using $N$, $M$, and $R$.

Theorems & Definitions (12)

Remark 1
Remark 2
Remark 3
Remark 4
Theorem 1
Lemma 1
Theorem 2
Theorem 3
Corollary 1
Lemma 2
...and 2 more

Consistent estimation of generative model representations in the data kernel perspective space

TL;DR

Abstract

Consistent estimation of generative model representations in the data kernel perspective space

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (12)