Model Equality Testing: Which Model Is This API Serving?
Irena Gao, Percy Liang, Carlos Guestrin
TL;DR
This work formalizes auditing black-box LLM inference as Model Equality Testing, recasting it as a two-sample distribution test between a reference $P$ and an API distribution $Q$. It shows that kernel-based MMD tests, particularly with a fast Hamming-based string kernel, provide strong power with small sample sizes (e.g., ~10 samples per prompt) across various distortions, including quantization, watermarking, and finetuning. The authors validate the approach on multiple Llama-model endpoints, flagging 11 of 31 endpoints as deviant from reference weights, and characterize inter-model distances and task-accuracy relationships. They also release open-source tooling and a dataset to enable reproducible API auditing, offering a practical pathway for users to assess API faithfulness and stability over time. The work has significant implications for reproducibility, transparency, and governance in AI deployment, motivating further methodological improvements and broader modality extensions.
Abstract
Users often interact with large language models through black-box inference APIs, both for closed- and open-weight models (e.g., Llama models are popularly accessed via Amazon Bedrock and Azure AI Studio). In order to cut costs or add functionality, API providers may quantize, watermark, or finetune the underlying model, changing the output distribution -- possibly without notifying users. We formalize detecting such distortions as Model Equality Testing, a two-sample testing problem, where the user collects samples from the API and a reference distribution and conducts a statistical test to see if the two distributions are the same. We find that tests based on the Maximum Mean Discrepancy between distributions are powerful for this task: a test built on a simple string kernel achieves a median of 77.4% power against a range of distortions, using an average of just 10 samples per prompt. We then apply this test to commercial inference APIs from Summer 2024 for four Llama models, finding that 11 out of 31 endpoints serve different distributions than reference weights released by Meta.
