Table of Contents
Fetching ...

Model Equality Testing: Which Model Is This API Serving?

Irena Gao, Percy Liang, Carlos Guestrin

TL;DR

This work formalizes auditing black-box LLM inference as Model Equality Testing, recasting it as a two-sample distribution test between a reference $P$ and an API distribution $Q$. It shows that kernel-based MMD tests, particularly with a fast Hamming-based string kernel, provide strong power with small sample sizes (e.g., ~10 samples per prompt) across various distortions, including quantization, watermarking, and finetuning. The authors validate the approach on multiple Llama-model endpoints, flagging 11 of 31 endpoints as deviant from reference weights, and characterize inter-model distances and task-accuracy relationships. They also release open-source tooling and a dataset to enable reproducible API auditing, offering a practical pathway for users to assess API faithfulness and stability over time. The work has significant implications for reproducibility, transparency, and governance in AI deployment, motivating further methodological improvements and broader modality extensions.

Abstract

Users often interact with large language models through black-box inference APIs, both for closed- and open-weight models (e.g., Llama models are popularly accessed via Amazon Bedrock and Azure AI Studio). In order to cut costs or add functionality, API providers may quantize, watermark, or finetune the underlying model, changing the output distribution -- possibly without notifying users. We formalize detecting such distortions as Model Equality Testing, a two-sample testing problem, where the user collects samples from the API and a reference distribution and conducts a statistical test to see if the two distributions are the same. We find that tests based on the Maximum Mean Discrepancy between distributions are powerful for this task: a test built on a simple string kernel achieves a median of 77.4% power against a range of distortions, using an average of just 10 samples per prompt. We then apply this test to commercial inference APIs from Summer 2024 for four Llama models, finding that 11 out of 31 endpoints serve different distributions than reference weights released by Meta.

Model Equality Testing: Which Model Is This API Serving?

TL;DR

This work formalizes auditing black-box LLM inference as Model Equality Testing, recasting it as a two-sample distribution test between a reference and an API distribution . It shows that kernel-based MMD tests, particularly with a fast Hamming-based string kernel, provide strong power with small sample sizes (e.g., ~10 samples per prompt) across various distortions, including quantization, watermarking, and finetuning. The authors validate the approach on multiple Llama-model endpoints, flagging 11 of 31 endpoints as deviant from reference weights, and characterize inter-model distances and task-accuracy relationships. They also release open-source tooling and a dataset to enable reproducible API auditing, offering a practical pathway for users to assess API faithfulness and stability over time. The work has significant implications for reproducibility, transparency, and governance in AI deployment, motivating further methodological improvements and broader modality extensions.

Abstract

Users often interact with large language models through black-box inference APIs, both for closed- and open-weight models (e.g., Llama models are popularly accessed via Amazon Bedrock and Azure AI Studio). In order to cut costs or add functionality, API providers may quantize, watermark, or finetune the underlying model, changing the output distribution -- possibly without notifying users. We formalize detecting such distortions as Model Equality Testing, a two-sample testing problem, where the user collects samples from the API and a reference distribution and conducts a statistical test to see if the two distributions are the same. We find that tests based on the Maximum Mean Discrepancy between distributions are powerful for this task: a test built on a simple string kernel achieves a median of 77.4% power against a range of distortions, using an average of just 10 samples per prompt. We then apply this test to commercial inference APIs from Summer 2024 for four Llama models, finding that 11 out of 31 endpoints serve different distributions than reference weights released by Meta.

Paper Structure

This paper contains 22 sections, 11 equations, 4 figures, 1 table, 2 algorithms.

Figures (4)

  • Figure 1: (Left) We formalize auditing black-box language model inference APIs as Model Equality Testing. This enables us to assess an API's faithfulness to a reference distribution and its stability over time. (Right) We evaluate candidate tests and apply the most powerful one to Llama model APIs from Summer 2024, finding that 11 of 31 endpoints deviate from reference weights released by Meta.
  • Figure 2: (Left) Sample complexity of tests. At an average of just 10 samples per prompt, the Hamming MMD test is able to detect quantization and watermarking with nontrivial power. Curves first median power across alternative distributions $Q$, averaged over language models and prompt distributions, with shaded standard errors. Results stratified by language model and alternative are in Appendix \ref{['app:additional_results_4_1']}. (Middle) While other tests rapidly degrade in power when the user is interested in longer completions, the Hamming MMD test maintains power best across completion lengths. (Right) Power of the Hamming MMD test, stratified by alternative distribution. The test is significantly less powerful against the fp16 alternative.
  • Figure 3: (Upper left) The Hamming MMD test is able to detect when Llama-3 8B has been finetuned on datasets of 1000.0 samples, even after a single epoch. Power is higher, earlier, when the finetuning distribution is i.i.d. with the testing distribution. (Lower left) The Hamming MMD test can also detect when two models are different with near-perfect power. Standard errors are over prompt distributions. Full results are in Appendix \ref{['app:higher_dimensional']}. (Right) The MMD framework allows us to estimate statistical distance between any models from which we can draw samples. The cells show average estimated MMDs over 10.0 bootstraps. Rows are sorted using spectral clustering with two components. Models within a family are typically clustered together, suggesting that factors like training data, rather than scale, determine model similarity.
  • Figure 4: (Left) Average MMD (Hamming) between providers for each model. Amazon Bedrock's Llama-3 and -3.1 70B models are the most different from the other providers. (Right) Absolute difference in HumanEval average accuracy vs. the MMD (Hamming). There is a moderate positive correlation between MMD and task accuracy. Gray points indicate pairs where both distributions have accuracy $<$ 10%. There are multiple ways to be wrong for a task, and the MMD captures these differences.