Data Kernel Perspective Space Performance Guarantees for Synthetic Data from Transformer Models
Michael Browder, Kevin Duh, J. David Harris, Vince Lyzinski, Paul McNamee, Youngser Park, Carey E. Priebe, Peter Viechnicki
TL;DR
The paper tackles the challenge of predicting and guaranteeing the quality of synthetic data produced by transformer models under data scarcity. It introduces Data Kernel Perspective Space (DKPS), a formalism that embeds multiple model outputs into a low-dimensional Euclidean space using MDS on a distance matrix derived from mean embeddings over a query set, enabling statistical guarantees on bias and variance. The authors apply DKPS to machine translation and contrastive preference optimization to reveal how synthetic data generated in batch vs sequential modes, and in-sample vs out-of-sample scenarios, differ in geometry and uncertainty, providing diagnostic tools for data provenance and debiasing. This framework offers a principled preprocessing and evaluation approach for synthetic data pipelines in NLP, with potential extensions to nonlinear embeddings and integration into downstream optimization schemes like CPO.
Abstract
Scarcity of labeled training data remains the long pole in the tent for building performant language technology and generative AI models. Transformer models -- particularly LLMs -- are increasingly being used to mitigate the data scarcity problem via synthetic data generation. However, because the models are black boxes, the properties of the synthetic data are difficult to predict. In practice it is common for language technology engineers to 'fiddle' with the LLM temperature setting and hope that what comes out the other end improves the downstream model. Faced with this uncertainty, here we propose Data Kernel Perspective Space (DKPS) to provide the foundation for mathematical analysis yielding concrete statistical guarantees for the quality of the outputs of transformer models. We first show the mathematical derivation of DKPS and how it provides performance guarantees. Next we show how DKPS performance guarantees can elucidate performance of a downstream task, such as neural machine translation models or LLMs trained using Contrastive Preference Optimization (CPO). Limitations of the current work and future research are also discussed.
