Conformal Generative Modeling with Improved Sample Efficiency through Sequential Greedy Filtering
Klaus-Rudolf Kladny, Bernhard Schölkopf, Michael Muehlebach
TL;DR
This work addresses the lack of statistical guarantees in generative model outputs by introducing SCOPE-Gen, a sequential conformal prediction framework that combines an i.i.d. generation stage with greedy filtering stages. The key idea is to exploit a Markov-chain factorization of admissibility across three steps, enabling independent calibration with a 1D conformal prediction at each stage and reducing costly admissibility evaluations. Empirical results on natural language generation and molecular graph extension show substantial reductions in queries, time, and final set size compared to baselines such as CLM, while maintaining the desired admissibility level at $1-\alpha$. The approach has practical impact for safety-critical applications where human oracle checks are expensive, offering a scalable and provably reliable way to generate admissible outputs from black-box generative models.
Abstract
Generative models lack rigorous statistical guarantees for their outputs and are therefore unreliable in safety-critical applications. In this work, we propose Sequential Conformal Prediction for Generative Models (SCOPE-Gen), a sequential conformal prediction method producing prediction sets that satisfy a rigorous statistical guarantee called conformal admissibility control. This guarantee states that with high probability, the prediction sets contain at least one admissible (or valid) example. To this end, our method first samples an initial set of i.i.d. examples from a black box generative model. Then, this set is iteratively pruned via so-called greedy filters. As a consequence of the iterative generation procedure, admissibility of the final prediction set factorizes as a Markov chain. This factorization is crucial, because it allows to control each factor separately, using conformal prediction. In comparison to prior work, our method demonstrates a large reduction in the number of admissibility evaluations during calibration. This reduction is important in safety-critical applications, where these evaluations must be conducted manually by domain experts and are therefore costly and time consuming. We highlight the advantages of our method in terms of admissibility evaluations and cardinality of the prediction sets through experiments in natural language generation and molecular graph extension tasks.
