Discovering and Learning Probabilistic Models of Black-Box AI Capabilities
Daniel Bramblett, Rushang Karia, Adrian Ciotinga, Ruthvick Suresh, Pulkit Verma, YooJung Choi, Siddharth Srivastava
TL;DR
The paper tackles the challenge of safely operating black-box AI systems by learning interpretable, probabilistic capability models that describe what intents a BBAI can achieve, under which conditions, and with what likelihood of outcomes. It introduces Probabilistic Capability Model Learning (PCML), an active-learning framework that uses MCTS to synthesize informative queries, and constructs pessimistic/optimistic models to bound and refine capabilities. The authors provide formal guarantees (soundness, completeness, convergence) and demonstrate empirical efficacy across diverse agents and environments, showing that PCML can reveal surprising limits and side effects in BBAIs while enabling safer deployment. The work contributes a scalable approach to learning user-centric, high-level representations of BBAI behavior that can guide deployment, design, and safety verification.
Abstract
Black-box AI (BBAI) systems such as foundational models are increasingly being used for sequential decision making. To ensure that such systems are safe to operate and deploy, it is imperative to develop efficient methods that can provide a sound and interpretable representation of the BBAI's capabilities. This paper shows that PDDL-style representations can be used to efficiently learn and model an input BBAI's planning capabilities. It uses the Monte-Carlo tree search paradigm to systematically create test tasks, acquire data, and prune the hypothesis space of possible symbolic models. Learned models describe a BBAI's capabilities, the conditions under which they can be executed, and the possible outcomes of executing them along with their associated probabilities. Theoretical results show soundness, completeness and convergence of the learned models. Empirical results with multiple BBAI systems illustrate the scope, efficiency, and accuracy of the presented methods.
