Logits of API-Protected LLMs Leak Proprietary Information

Matthew Finlayson; Xiang Ren; Swabha Swayamdipta

Logits of API-Protected LLMs Leak Proprietary Information

Matthew Finlayson, Xiang Ren, Swabha Swayamdipta

TL;DR

The paper shows that API-protected LLMs inherently expose a low-dimensional image of the output space due to the softmax bottleneck, enabling efficient extraction of full vocab outputs, embedding size, and model-origin signatures. It introduces practical algorithms to reconstruct full next-token distributions from biased or biased-top-$k$ API responses, including numerically stable and stochastic-case adaptations, and demonstrates a fast $O(d)$-query method using the LLM image as a basis. It further demonstrates how the embedding size can be estimated (around 4096 for gpt-3.5-turbo) and how the image can be used to attribute outputs to specific models, detect updates (including LoRA changes), and support additional auditing applications. The work also discusses mitigations, arguing that complete defenses are costly or degrade API utility, framing the image as a potential tool for transparency and trust between providers and clients rather than a purely adversarial vulnerability.

Abstract

Large language model (LLM) providers often hide the architectural details and parameters of their proprietary models by restricting public access to a limited API. In this work we show that, with only a conservative assumption about the model architecture, it is possible to learn a surprisingly large amount of non-public information about an API-protected LLM from a relatively small number of API queries (e.g., costing under $1000 USD for OpenAI's gpt-3.5-turbo). Our findings are centered on one key observation: most modern LLMs suffer from a softmax bottleneck, which restricts the model outputs to a linear subspace of the full output space. We exploit this fact to unlock several capabilities, including (but not limited to) obtaining cheap full-vocabulary outputs, auditing for specific types of model updates, identifying the source LLM given a single full LLM output, and even efficiently discovering the LLM's hidden size. Our empirical investigations show the effectiveness of our methods, which allow us to estimate the embedding size of OpenAI's gpt-3.5-turbo to be about 4096. Lastly, we discuss ways that LLM providers can guard against these attacks, as well as how these capabilities can be viewed as a feature (rather than a bug) by allowing for greater transparency and accountability.

Logits of API-Protected LLMs Leak Proprietary Information

TL;DR

API responses, including numerically stable and stochastic-case adaptations, and demonstrates a fast

-query method using the LLM image as a basis. It further demonstrates how the embedding size can be estimated (around 4096 for gpt-3.5-turbo) and how the image can be used to attribute outputs to specific models, detect updates (including LoRA changes), and support additional auditing applications. The work also discusses mitigations, arguing that complete defenses are costly or degrade API utility, framing the image as a potential tool for transparency and trust between providers and clients rather than a purely adversarial vulnerability.

Abstract

Paper Structure (21 sections, 2 theorems, 24 equations, 6 figures, 2 tables, 2 algorithms)

This paper contains 21 sections, 2 theorems, 24 equations, 6 figures, 2 tables, 2 algorithms.

Introduction
LLM outputs are restricted to a low-dimensional linear space
Obtaining full outputs from API-protected LLMs
Full outputs from APIs with logprobs
Numerically stable full outputs from APIs
Full outputs from stochastic APIs
Fast, full outputs using the LLM image
Discovering the embedding size of API-protected LLMs
Attributing model outputs and auditing model updates
More Applications
Detecting LoRA updates
Finding unargmaxable tokens
Recovering the softmax matrix from outputs
Basis-aware sampling
Mitigations
...and 6 more sections

Key Result

Theorem 1

LLM logits lie on a $d$-dimensional subspace of $\mathbb{R}^v$.

Figures (6)

Figure 1: LLM outputs are constrained to a low-dimensional subspace of the full output space. We can use this fact to glean information about API-protected LLMs by analyzing their outputs. Here we show how a toy LLM's low-dimensional embeddings in $\mathbb{R}^d$ (illustrated here as a 1-D space) are transformed linearly into logits in $\mathbb{R}^v$ (here, a 3D space) via the softmax matrix $\boldsymbol{W}$. The resulting outputs lie within a ($d=1$)-dimensional subspace of the output space. We call this low-dimensional subspace the image of the model. We can obtain a basis for the image of an API-protected LLM by collecting $d$ of its outputs. The LLM's image can reveal non-public information, such as the LLM's embedding size, but it can also be used for accountability, such as verifying which LLM an API is serving.
Figure 2: A typical language model architecture. After the input its processed by a neural network, usually a transformer Vaswani2017AttentionIA, into a low-dimensional embedding $\boldsymbol{h}$, it is multiplied by the softmax matrix $\boldsymbol{W}$, projecting it linearly from $\mathbb{R}^d$ onto $\mathbb{R}^v$ to obtain the logit vector $\boldsymbol\ell$. The softmax function is then applied to the logit vector to obtain a valid probability distribution $\boldsymbol{p}$ over next-token candidates.
Figure 3: Points in the logit space $\mathbb{R}^v$ (far left) are mapped via the softmax function to points (probability distributions) on the simplex $\Delta_v$ (middle left). Crucially, the softmax maps all points that lie on the same diagonal (shown as points of the same color) to the same probability distribution. For numerical stability, these values are often stored as log-probabilities (middle right). The clr transform returns probability distributions to points to a subspace $U_v$ of the logit space (far right). The softmax function and clr transform are inverses of one another, and form an isomorphism between $U_v$ and $\Delta_v$.
Figure 4: The singular values of outputs from LLMs with various known and unknown embedding sizes $d$. For each model with known embedding size, there is a clear drop in magnitude at singular value index $d$, indicating the embedding size of the model. Using this observation, we can guess the embedding size of gpt-3.5-turbo to be 4096.
Figure 5: Residuals of the least-squares solution of $\boldsymbol{L}\boldsymbol{x}=\boldsymbol\ell$ for an output $\boldsymbol\ell$ from the pythia-70m checkpoint at training step 120000.0, and output matrices $\boldsymbol{L}$ from various Pythia model checkpoints. High residuals indicate that the output is not in a model's image.
...and 1 more figures

Theorems & Definitions (7)

Theorem 1: Low-rank logits
proof
Theorem 2: Low-rank probabilities
proof
proof
proof
proof

Logits of API-Protected LLMs Leak Proprietary Information

TL;DR

Abstract

Logits of API-Protected LLMs Leak Proprietary Information

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (7)