Table of Contents
Fetching ...

Predicting the Performance of Black-box LLMs through Follow-up Queries

Dylan Sam, Marc Finzi, J. Zico Kolter

TL;DR

This work introduces QueRE, a black-box approach that predicts LLM behavior by eliciting follow-up queries and using the resulting yes/no probabilities as features for a simple linear predictor. The method achieves strong instance-level accuracy on QA and reasoning tasks and can detect adversarial prompts and distinguish between LLM architectures, all without access to internal model states. Theoretical analysis shows convergence of the sampling-based approximation when top-k probabilities are unavailable, and empirical results demonstrate transferability, tight generalization bounds, and favorable latency-accuracy trade-offs. Overall, QueRE offers a practical, model-agnostic tool for monitoring and auditing black-box LLMs in real-world deployments.

Abstract

Reliably predicting the behavior of language models -- such as whether their outputs are correct or have been adversarially manipulated -- is a fundamentally challenging task. This is often made even more difficult as frontier language models are offered only through closed-source APIs, providing only black-box access. In this paper, we predict the behavior of black-box language models by asking follow-up questions and taking the probabilities of responses \emph{as} representations to train reliable predictors. We first demonstrate that training a linear model on these responses reliably and accurately predicts model correctness on question-answering and reasoning benchmarks. Surprisingly, this can \textit{even outperform white-box linear predictors} that operate over model internals or activations. Furthermore, we demonstrate that these follow-up question responses can reliably distinguish between a clean version of an LLM and one that has been adversarially influenced via a system prompt to answer questions incorrectly or to introduce bugs into generated code. Finally, we show that they can also be used to differentiate between black-box LLMs, enabling the detection of misrepresented models provided through an API. Overall, our work shows promise in monitoring black-box language model behavior, supporting their deployment in larger, autonomous systems.

Predicting the Performance of Black-box LLMs through Follow-up Queries

TL;DR

This work introduces QueRE, a black-box approach that predicts LLM behavior by eliciting follow-up queries and using the resulting yes/no probabilities as features for a simple linear predictor. The method achieves strong instance-level accuracy on QA and reasoning tasks and can detect adversarial prompts and distinguish between LLM architectures, all without access to internal model states. Theoretical analysis shows convergence of the sampling-based approximation when top-k probabilities are unavailable, and empirical results demonstrate transferability, tight generalization bounds, and favorable latency-accuracy trade-offs. Overall, QueRE offers a practical, model-agnostic tool for monitoring and auditing black-box LLMs in real-world deployments.

Abstract

Reliably predicting the behavior of language models -- such as whether their outputs are correct or have been adversarially manipulated -- is a fundamentally challenging task. This is often made even more difficult as frontier language models are offered only through closed-source APIs, providing only black-box access. In this paper, we predict the behavior of black-box language models by asking follow-up questions and taking the probabilities of responses \emph{as} representations to train reliable predictors. We first demonstrate that training a linear model on these responses reliably and accurately predicts model correctness on question-answering and reasoning benchmarks. Surprisingly, this can \textit{even outperform white-box linear predictors} that operate over model internals or activations. Furthermore, we demonstrate that these follow-up question responses can reliably distinguish between a clean version of an LLM and one that has been adversarially influenced via a system prompt to answer questions incorrectly or to introduce bugs into generated code. Finally, we show that they can also be used to differentiate between black-box LLMs, enabling the detection of misrepresented models provided through an API. Overall, our work shows promise in monitoring black-box language model behavior, supporting their deployment in larger, autonomous systems.
Paper Structure (61 sections, 2 theorems, 20 equations, 11 figures, 13 tables)

This paper contains 61 sections, 2 theorems, 20 equations, 11 figures, 13 tables.

Key Result

Proposition 1

Let $\hat{\beta}$ be the MLE for the logistic regression on the dataset $\{(x_i^j, y_i) | i = 1, ..., n, j = 1, ..., k\}$, where $x_i^j$ are independent samples from Ber$(p_i)$. We assume there exists some unique optimal set of weights $\beta_0$ over inputs $p = (p_1, ..., p_d)$, and we let $n, k >>

Figures (11)

  • Figure 1: Our approach predicts LLM behavior using linear predictors trained on features derived from follow-up questions posed to the LLM. We show that responses to follow-up questions are highly predictive of correctness on downstream benchmarks, and are useful in distinguishing between black-box models and for detecting if models have been influenced by an adversary.
  • Figure 2: AUROC in predicting model performance on the open-ended QA benchmarks of Natural Questions (Top) and SQuAD (Bottom). Dashed bars represent white-box methods, which assume more access than QueRE. QueRE often best predicts model performance on open-ended QA tasks, even when compared to white-box methods.
  • Figure 3: AUROC in predicting model performance on closed-ended QA benchmarks of HaluEval, BoolQ, and DHate. Dashed bars represent white-box methods.
  • Figure 4: Accuracy in distinguishing representations from LLMs of different sizes on the BoolQ task.
  • Figure 5: Left: AUROC as we vary the number of random samples $k$ used to approximate LLM probabilities with GPT-3.5 on HaluEval over 5 random seeds. We observe that there is not a significant dropoff in performance when using approximations due to sampling. Right: AUROC on predicting LLaMA3-70B performance on BoolQ with QueRE as we increase the number of follow-up questions. The shaded area represents the standard error.
  • ...and 6 more figures

Theorems & Definitions (3)

  • Proposition 1: Estimator on Finite Samples from LLM
  • Proposition 1: Estimator on Finite Samples from LLM
  • proof