Table of Contents
Fetching ...

Concentration bounds on response-based vector embeddings of black-box generative models

Aranyak Acharyya, Joshua Agterberg, Youngser Park, Carey E. Priebe

TL;DR

This work develops finite-sample concentration bounds for response-based embeddings of a set of black-box generative models using Data Kernel Perspective Space (DKPS) embedding. It proves an entrywise concentration result for the doubly-centered dissimilarity matrix $ extbf{B}$ around its noisy counterpart $ ext{ exthat{B}}$, under bounded response variability and a growth regime $r= ext{ω}(n^3)$, and then translates this into a high-probability bound on the estimated DKPS embeddings $ ext{ exthat{ψ}}$ up to an orthogonal transformation, via Weyl and Davis–Kahan perturbation theory. The main convergence rate is expressed as a cubic polynomial in $(n^3/r)^{1/2-oldsymbol{ heta}}$, enabling practitioners to choose a sufficient replicate size $r$ for a target accuracy. Numerical simulations and real-data experiments with large language models validate the bounds, demonstrating meaningful finite-sample guarantees for comparing and inferring properties of black-box generative systems. The framework also points to extensions to classical MDS under noisy dissimilarities and guides future work on relaxing assumptions and improving bound sharpness for broader applicability.

Abstract

Generative models, such as large language models or text-to-image diffusion models, can generate relevant responses to user-given queries. Response-based vector embeddings of generative models facilitate statistical analysis and inference on a given collection of black-box generative models. The Data Kernel Perspective Space embedding is one particular method of obtaining response-based vector embeddings for a given set of generative models, already discussed in the literature. In this paper, under appropriate regularity conditions, we establish high probability concentration bounds on the sample vector embeddings for a given set of generative models, obtained through the method of Data Kernel Perspective Space embedding. Our results tell us the required number of sample responses needed in order to approximate the population-level vector embeddings with a desired level of accuracy. The algebraic tools used to establish our results can be used further for establishing concentration bounds on Classical Multidimensional Scaling embeddings in general, when the dissimilarities are observed with noise.

Concentration bounds on response-based vector embeddings of black-box generative models

TL;DR

This work develops finite-sample concentration bounds for response-based embeddings of a set of black-box generative models using Data Kernel Perspective Space (DKPS) embedding. It proves an entrywise concentration result for the doubly-centered dissimilarity matrix around its noisy counterpart , under bounded response variability and a growth regime , and then translates this into a high-probability bound on the estimated DKPS embeddings up to an orthogonal transformation, via Weyl and Davis–Kahan perturbation theory. The main convergence rate is expressed as a cubic polynomial in , enabling practitioners to choose a sufficient replicate size for a target accuracy. Numerical simulations and real-data experiments with large language models validate the bounds, demonstrating meaningful finite-sample guarantees for comparing and inferring properties of black-box generative systems. The framework also points to extensions to classical MDS under noisy dissimilarities and guides future work on relaxing assumptions and improving bound sharpness for broader applicability.

Abstract

Generative models, such as large language models or text-to-image diffusion models, can generate relevant responses to user-given queries. Response-based vector embeddings of generative models facilitate statistical analysis and inference on a given collection of black-box generative models. The Data Kernel Perspective Space embedding is one particular method of obtaining response-based vector embeddings for a given set of generative models, already discussed in the literature. In this paper, under appropriate regularity conditions, we establish high probability concentration bounds on the sample vector embeddings for a given set of generative models, obtained through the method of Data Kernel Perspective Space embedding. Our results tell us the required number of sample responses needed in order to approximate the population-level vector embeddings with a desired level of accuracy. The algebraic tools used to establish our results can be used further for establishing concentration bounds on Classical Multidimensional Scaling embeddings in general, when the dissimilarities are observed with noise.

Paper Structure

This paper contains 12 sections, 5 theorems, 66 equations, 1 figure, 2 tables, 1 algorithm.

Key Result

Theorem 1

In our setting, suppose $\boldsymbol{\Sigma}_{ij} \in \mathbb{R}^{p \times p}$ is the covariance matrix associated with the distribution $F_{ij}$, and let $\gamma_{ij}=\mathrm{trace}(\boldsymbol{\Sigma}_{ij})$. Then, for any $\epsilon>0$,

Figures (1)

  • Figure 1: Violin plot (left panel) and histogram (right panel) of the values of estimation error $y=\min_{\mathbf{W} \in \mathcal{O}(d)} \left\lVert \hat{\boldsymbol{\psi}} \mathbf{W}- \boldsymbol{\psi} \right\rVert$ over $100$ Monte Carlo samples for each of $n=4,6,8,10,12$. For each value of $n$, every generative model provides $r=n^5$ i.i.d. responses to every query on each of $100$ Monte Carlo samples. Both the figures show that as $n$ increases, the estimation error gets more concentrated toward zero.

Theorems & Definitions (6)

  • Remark 1
  • Theorem 1
  • Corollary 1
  • Proposition 1
  • Theorem 2
  • Corollary 2