Concentration bounds on response-based vector embeddings of black-box generative models
Aranyak Acharyya, Joshua Agterberg, Youngser Park, Carey E. Priebe
TL;DR
This work develops finite-sample concentration bounds for response-based embeddings of a set of black-box generative models using Data Kernel Perspective Space (DKPS) embedding. It proves an entrywise concentration result for the doubly-centered dissimilarity matrix $ extbf{B}$ around its noisy counterpart $ ext{ exthat{B}}$, under bounded response variability and a growth regime $r= ext{ω}(n^3)$, and then translates this into a high-probability bound on the estimated DKPS embeddings $ ext{ exthat{ψ}}$ up to an orthogonal transformation, via Weyl and Davis–Kahan perturbation theory. The main convergence rate is expressed as a cubic polynomial in $(n^3/r)^{1/2-oldsymbol{ heta}}$, enabling practitioners to choose a sufficient replicate size $r$ for a target accuracy. Numerical simulations and real-data experiments with large language models validate the bounds, demonstrating meaningful finite-sample guarantees for comparing and inferring properties of black-box generative systems. The framework also points to extensions to classical MDS under noisy dissimilarities and guides future work on relaxing assumptions and improving bound sharpness for broader applicability.
Abstract
Generative models, such as large language models or text-to-image diffusion models, can generate relevant responses to user-given queries. Response-based vector embeddings of generative models facilitate statistical analysis and inference on a given collection of black-box generative models. The Data Kernel Perspective Space embedding is one particular method of obtaining response-based vector embeddings for a given set of generative models, already discussed in the literature. In this paper, under appropriate regularity conditions, we establish high probability concentration bounds on the sample vector embeddings for a given set of generative models, obtained through the method of Data Kernel Perspective Space embedding. Our results tell us the required number of sample responses needed in order to approximate the population-level vector embeddings with a desired level of accuracy. The algebraic tools used to establish our results can be used further for establishing concentration bounds on Classical Multidimensional Scaling embeddings in general, when the dissimilarities are observed with noise.
