Predicting the Performance of Black-box LLMs through Follow-up Queries
Dylan Sam, Marc Finzi, J. Zico Kolter
TL;DR
This work introduces QueRE, a black-box approach that predicts LLM behavior by eliciting follow-up queries and using the resulting yes/no probabilities as features for a simple linear predictor. The method achieves strong instance-level accuracy on QA and reasoning tasks and can detect adversarial prompts and distinguish between LLM architectures, all without access to internal model states. Theoretical analysis shows convergence of the sampling-based approximation when top-k probabilities are unavailable, and empirical results demonstrate transferability, tight generalization bounds, and favorable latency-accuracy trade-offs. Overall, QueRE offers a practical, model-agnostic tool for monitoring and auditing black-box LLMs in real-world deployments.
Abstract
Reliably predicting the behavior of language models -- such as whether their outputs are correct or have been adversarially manipulated -- is a fundamentally challenging task. This is often made even more difficult as frontier language models are offered only through closed-source APIs, providing only black-box access. In this paper, we predict the behavior of black-box language models by asking follow-up questions and taking the probabilities of responses \emph{as} representations to train reliable predictors. We first demonstrate that training a linear model on these responses reliably and accurately predicts model correctness on question-answering and reasoning benchmarks. Surprisingly, this can \textit{even outperform white-box linear predictors} that operate over model internals or activations. Furthermore, we demonstrate that these follow-up question responses can reliably distinguish between a clean version of an LLM and one that has been adversarially influenced via a system prompt to answer questions incorrectly or to introduce bugs into generated code. Finally, we show that they can also be used to differentiate between black-box LLMs, enabling the detection of misrepresented models provided through an API. Overall, our work shows promise in monitoring black-box language model behavior, supporting their deployment in larger, autonomous systems.
