Logits of API-Protected LLMs Leak Proprietary Information
Matthew Finlayson, Xiang Ren, Swabha Swayamdipta
TL;DR
The paper shows that API-protected LLMs inherently expose a low-dimensional image of the output space due to the softmax bottleneck, enabling efficient extraction of full vocab outputs, embedding size, and model-origin signatures. It introduces practical algorithms to reconstruct full next-token distributions from biased or biased-top-$k$ API responses, including numerically stable and stochastic-case adaptations, and demonstrates a fast $O(d)$-query method using the LLM image as a basis. It further demonstrates how the embedding size can be estimated (around 4096 for gpt-3.5-turbo) and how the image can be used to attribute outputs to specific models, detect updates (including LoRA changes), and support additional auditing applications. The work also discusses mitigations, arguing that complete defenses are costly or degrade API utility, framing the image as a potential tool for transparency and trust between providers and clients rather than a purely adversarial vulnerability.
Abstract
Large language model (LLM) providers often hide the architectural details and parameters of their proprietary models by restricting public access to a limited API. In this work we show that, with only a conservative assumption about the model architecture, it is possible to learn a surprisingly large amount of non-public information about an API-protected LLM from a relatively small number of API queries (e.g., costing under $1000 USD for OpenAI's gpt-3.5-turbo). Our findings are centered on one key observation: most modern LLMs suffer from a softmax bottleneck, which restricts the model outputs to a linear subspace of the full output space. We exploit this fact to unlock several capabilities, including (but not limited to) obtaining cheap full-vocabulary outputs, auditing for specific types of model updates, identifying the source LLM given a single full LLM output, and even efficiently discovering the LLM's hidden size. Our empirical investigations show the effectiveness of our methods, which allow us to estimate the embedding size of OpenAI's gpt-3.5-turbo to be about 4096. Lastly, we discuss ways that LLM providers can guard against these attacks, as well as how these capabilities can be viewed as a feature (rather than a bug) by allowing for greater transparency and accountability.
