Table of Contents
Fetching ...

Log Probability Tracking of LLM APIs

Timothée Chauvin, Erwan Le Merrer, François Taïani, Gilles Tredan

TL;DR

The paper introduces logprob tracking (LT), a low-cost method for auditing LLM API consistency by analyzing first-token logprob distributions via a two-sample permutation test. It defines the TinyChange benchmark to evaluate detection sensitivity across subtle model changes and demonstrates that LT detects changes as small as a single finetuning step while reducing costs by orders of magnitude compared to existing methods. Real-world deployment across hundreds of API endpoints reveals widespread undocumented shifts, underscoring the practical need for continuous, lightweight monitoring. LT provides a pragmatic first line of defense for reproducibility and integrity, and can be integrated into existing audit pipelines to trigger deeper investigations when changes are detected.

Abstract

When using an LLM through an API provider, users expect the served model to remain consistent over time, a property crucial for the reliability of downstream applications and the reproducibility of research. Existing audit methods are too costly to apply at regular time intervals to the wide range of available LLM APIs. This means that model updates are left largely unmonitored in practice. In this work, we show that while LLM log probabilities (logprobs) are usually non-deterministic, they can still be used as the basis for cost-effective continuous monitoring of LLM APIs. We apply a simple statistical test based on the average value of each token logprob, requesting only a single token of output. This is enough to detect changes as small as one step of fine-tuning, making this approach more sensitive than existing methods while being 1,000x cheaper. We introduce the TinyChange benchmark as a way to measure the sensitivity of audit methods in the context of small, realistic model changes.

Log Probability Tracking of LLM APIs

TL;DR

The paper introduces logprob tracking (LT), a low-cost method for auditing LLM API consistency by analyzing first-token logprob distributions via a two-sample permutation test. It defines the TinyChange benchmark to evaluate detection sensitivity across subtle model changes and demonstrates that LT detects changes as small as a single finetuning step while reducing costs by orders of magnitude compared to existing methods. Real-world deployment across hundreds of API endpoints reveals widespread undocumented shifts, underscoring the practical need for continuous, lightweight monitoring. LT provides a pragmatic first line of defense for reproducibility and integrity, and can be integrated into existing audit pipelines to trigger deeper investigations when changes are detected.

Abstract

When using an LLM through an API provider, users expect the served model to remain consistent over time, a property crucial for the reliability of downstream applications and the reproducibility of research. Existing audit methods are too costly to apply at regular time intervals to the wide range of available LLM APIs. This means that model updates are left largely unmonitored in practice. In this work, we show that while LLM log probabilities (logprobs) are usually non-deterministic, they can still be used as the basis for cost-effective continuous monitoring of LLM APIs. We apply a simple statistical test based on the average value of each token logprob, requesting only a single token of output. This is enough to detect changes as small as one step of fine-tuning, making this approach more sensitive than existing methods while being 1,000x cheaper. We introduce the TinyChange benchmark as a way to measure the sensitivity of audit methods in the context of small, realistic model changes.

Paper Structure

This paper contains 41 sections, 8 equations, 15 figures, 3 tables.

Figures (15)

  • Figure 1: Logprob non-determinism in the wild: logprobs returned by the GPT-4.1 API, in August 2025.
  • Figure 2: Setup of our change detection method logprob tracking (LT).
  • Figure 3: Minimal impact of the prompt length on the performance of LT. Average AUC across models and variants, 2,000 bootstraps at the model and test statistics levels.
  • Figure 4: Average ROC AUC by difficulty level and method across LLMs (in each plot, difficulty increases from left to right). 95% CIs from 10,000 bootstraps at the model and test statistic levels.
  • Figure 5: Dates of changes across providers and endpoints, queried hourly with the prompt "x". Periods where the endpoints weren't tracked are greyed out. 2 of the 37 changes (devstral-small-2505:free and mistral-small-3.1-24b-instruct:free) are also present in the non-free endpoints.
  • ...and 10 more figures