Predicting the Performance of Black-box LLMs through Follow-up Queries

Dylan Sam; Marc Finzi; J. Zico Kolter

Predicting the Performance of Black-box LLMs through Follow-up Queries

Dylan Sam, Marc Finzi, J. Zico Kolter

TL;DR

This work introduces QueRE, a black-box approach that predicts LLM behavior by eliciting follow-up queries and using the resulting yes/no probabilities as features for a simple linear predictor. The method achieves strong instance-level accuracy on QA and reasoning tasks and can detect adversarial prompts and distinguish between LLM architectures, all without access to internal model states. Theoretical analysis shows convergence of the sampling-based approximation when top-k probabilities are unavailable, and empirical results demonstrate transferability, tight generalization bounds, and favorable latency-accuracy trade-offs. Overall, QueRE offers a practical, model-agnostic tool for monitoring and auditing black-box LLMs in real-world deployments.

Abstract

Reliably predicting the behavior of language models -- such as whether their outputs are correct or have been adversarially manipulated -- is a fundamentally challenging task. This is often made even more difficult as frontier language models are offered only through closed-source APIs, providing only black-box access. In this paper, we predict the behavior of black-box language models by asking follow-up questions and taking the probabilities of responses \emph{as} representations to train reliable predictors. We first demonstrate that training a linear model on these responses reliably and accurately predicts model correctness on question-answering and reasoning benchmarks. Surprisingly, this can \textit{even outperform white-box linear predictors} that operate over model internals or activations. Furthermore, we demonstrate that these follow-up question responses can reliably distinguish between a clean version of an LLM and one that has been adversarially influenced via a system prompt to answer questions incorrectly or to introduce bugs into generated code. Finally, we show that they can also be used to differentiate between black-box LLMs, enabling the detection of misrepresented models provided through an API. Overall, our work shows promise in monitoring black-box language model behavior, supporting their deployment in larger, autonomous systems.

Predicting the Performance of Black-box LLMs through Follow-up Queries

TL;DR

Abstract

Paper Structure (61 sections, 2 theorems, 20 equations, 11 figures, 13 tables)

This paper contains 61 sections, 2 theorems, 20 equations, 11 figures, 13 tables.

Introduction
Related Work
Predicting Model Performance
Uncertainty Quantification in LLMs
Extracting Features from Neural Networks
Predicting Performance with Follow-up Queries
Predictive Features through Follow-up Responses
Generating Follow-up Prompts
Theoretical Analysis of Sampling-based Approximations
Experiments
Baselines
Datasets and Models
Predicting Model Correctness on QA and Reasoning Tasks
Detecting Adversarially Influenced LLMs
Distinguishing Between Black-box LLMs
...and 46 more sections

Key Result

Proposition 1

Let $\hat{\beta}$ be the MLE for the logistic regression on the dataset $\{(x_i^j, y_i) | i = 1, ..., n, j = 1, ..., k\}$, where $x_i^j$ are independent samples from Ber$(p_i)$. We assume there exists some unique optimal set of weights $\beta_0$ over inputs $p = (p_1, ..., p_d)$, and we let $n, k >>

Figures (11)

Figure 1: Our approach predicts LLM behavior using linear predictors trained on features derived from follow-up questions posed to the LLM. We show that responses to follow-up questions are highly predictive of correctness on downstream benchmarks, and are useful in distinguishing between black-box models and for detecting if models have been influenced by an adversary.
Figure 2: AUROC in predicting model performance on the open-ended QA benchmarks of Natural Questions (Top) and SQuAD (Bottom). Dashed bars represent white-box methods, which assume more access than QueRE. QueRE often best predicts model performance on open-ended QA tasks, even when compared to white-box methods.
Figure 3: AUROC in predicting model performance on closed-ended QA benchmarks of HaluEval, BoolQ, and DHate. Dashed bars represent white-box methods.
Figure 4: Accuracy in distinguishing representations from LLMs of different sizes on the BoolQ task.
Figure 5: Left: AUROC as we vary the number of random samples $k$ used to approximate LLM probabilities with GPT-3.5 on HaluEval over 5 random seeds. We observe that there is not a significant dropoff in performance when using approximations due to sampling. Right: AUROC on predicting LLaMA3-70B performance on BoolQ with QueRE as we increase the number of follow-up questions. The shaded area represents the standard error.
...and 6 more figures

Theorems & Definitions (3)

Proposition 1: Estimator on Finite Samples from LLM
Proposition 1: Estimator on Finite Samples from LLM
proof

Predicting the Performance of Black-box LLMs through Follow-up Queries

TL;DR

Abstract

Predicting the Performance of Black-box LLMs through Follow-up Queries

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (3)