Table of Contents
Fetching ...

Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test

Xiaoyuan Zhu, Yaowen Ye, Tianyi Qiu, Hanlin Zhu, Sijun Tan, Ajraf Mannan, Jonathan Michala, Raluca Ada Popa, Willie Neiswanger

Abstract

As API access becomes a primary interface to large language models (LLMs), users often interact with black-box systems that offer little transparency into the deployed model. To reduce costs or maliciously alter model behaviors, API providers may discreetly serve quantized or fine-tuned variants, which can degrade performance and compromise safety. Detecting such substitutions is difficult, as users lack access to model weights and, in most cases, even output logits. To tackle this problem, we propose a rank-based uniformity test that can verify the behavioral equality of a black-box LLM to a locally deployed authentic model. Our method is accurate, query-efficient, and avoids detectable query patterns, making it robust to adversarial providers that reroute or mix responses upon the detection of testing attempts. We evaluate the approach across diverse threat scenarios, including quantization, harmful fine-tuning, jailbreak prompts, and full model substitution, showing that it consistently achieves superior statistical power over prior methods under constrained query budgets.

Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test

Abstract

As API access becomes a primary interface to large language models (LLMs), users often interact with black-box systems that offer little transparency into the deployed model. To reduce costs or maliciously alter model behaviors, API providers may discreetly serve quantized or fine-tuned variants, which can degrade performance and compromise safety. Detecting such substitutions is difficult, as users lack access to model weights and, in most cases, even output logits. To tackle this problem, we propose a rank-based uniformity test that can verify the behavioral equality of a black-box LLM to a locally deployed authentic model. Our method is accurate, query-efficient, and avoids detectable query patterns, making it robust to adversarial providers that reroute or mix responses upon the detection of testing attempts. We evaluate the approach across diverse threat scenarios, including quantization, harmful fine-tuning, jailbreak prompts, and full model substitution, showing that it consistently achieves superior statistical power over prior methods under constrained query budgets.

Paper Structure

This paper contains 29 sections, 4 theorems, 44 equations, 5 figures, 4 tables, 1 algorithm.

Key Result

Lemma C.1

Let the rank of a token be a random variable $K$ on $\mathbb{Z}^+$ with probability mass function $p(k) = k^{-\alpha}/\zeta(\alpha)$ for $\alpha > 1$. Let $x = \log k$. The survival function $S(x) = P(\log K \ge x)$, for large $k=e^x$, has the asymptotic form

Figures (5)

  • Figure 1: Statistical power of different methods in detecting substitution of the Gemma-2-9b-it with its 4-bit quantized variant, as the proportion of API responses from the quantized model increases. Our method significantly outperforms MMD gao2025modelequalitytestingmodel and the Kolmogorov–Smirnov (KS) baseline.
  • Figure 2: Distribution of AUROC scores for five candidate score functions across 500 trials comparing Gemma-2-9b-it and its 4-bit quantized variant. Log-rank achieves the most separable distribution from the random level $0.5$, indicating superior power in distinguishing different models.
  • Figure 3: AUC of SFT checkpoints across epochs.
  • Figure 4: AUC for detecting full model replacement. Each cell shows the AUC score between a reference and a target model. Diagonal values represent self-comparisons.
  • Figure 5: Statistical power AUC for detecting decoding parameter mismatches (temperature, top-$p$) across models and datasets. Each cell compares outputs under a specific decoding configuration against the default $(0.5, 1.0)$; higher values indicate stronger detectability.

Theorems & Definitions (8)

  • Lemma C.1
  • proof
  • Lemma C.2
  • proof
  • Theorem C.1
  • proof
  • Theorem C.2
  • proof