Consistent estimation of generative model representations in the data kernel perspective space
Aranyak Acharyya, Michael W. Trosset, Carey E. Priebe, Hayden S. Helm
TL;DR
The paper addresses how to consistently estimate a low-dimensional data kernel perspective space (DKPS) for a collection of generative models from their responses to a set of queries. It adopts raw-stress multidimensional scaling (MDS) to embed model mean discrepancies into $\mathbb{R}^d$, yielding $\hat{\boldsymbol{\psi}}=\mathrm{mds}(\mathbf{D})$ that approximates the population DKPS $\boldsymbol{\psi}=\mathrm{mds}(\boldsymbol{\Delta})$, with $D_{ii'}=\frac{1}{m}\|\bar{\mathbf{X}}_{i}-\bar{\mathbf{X}}_{i'}\|_F$ and $\Delta_{ii'}=\frac{1}{m}\|\boldsymbol{\mu}_i-\boldsymbol{\mu}_{i'}\|_F$. The authors establish sufficient conditions for consistency under three growth regimes of models and queries, linking convergence to the behavior of replication counts $r$, query count $m$, and model count $n$, and they provide empirical validation on language models and text-to-image models. The main contributions are the theoretical results (theorems and corollaries) that guarantee convergence of the estimated DKPS to the population DKPS up to affine transformations, along with practical guidance on how replication, query, and model growth influence estimation quality. This work provides a principled foundation for comparing and tracking evolution of generative models via a principled, low-dimensional embedding framework, with implications for model safety, data mixture inference, and benchmarking.
Abstract
Generative models, such as large language models and text-to-image diffusion models, produce relevant information when presented a query. Different models may produce different information when presented the same query. As the landscape of generative models evolves, it is important to develop techniques to study and analyze differences in model behaviour. In this paper we present novel theoretical results for embedding-based representations of generative models in the context of a set of queries. In particular, we establish sufficient conditions for the consistent estimation of the model embeddings in situations where the query set and the number of models grow.
