Consensus Sampling for Safer Generative AI
Adam Tauman Kalai, Yael Tauman Kalai, Or Zamir
TL;DR
This work introduces consensus sampling, a model-aggregation approach that assigns safety to the aggregated output based on agreement across a set of $k$ models and a trusted safe subset of size $s$, outputting a sample $y$ only when sufficient overlap among safe models exists and abstaining otherwise. It provides an efficient sampling algorithm with risk guarantees: the output risk is at most $R$ times the average risk of the safest $s$ models, and abstention probability decays exponentially with $R$ under reasonable overlap conditions; the abstention mechanism serves as a safety brake when overlap is insufficient. The framework builds on information-theoretic ideas and cryptographic perspectives, introducing $R$-robustness and showing near-optimality of the sampling distribution $q^*$ with respect to safety and abstention tradeoffs, while bounding information leakage to $O( obreak log(R+1))$ bits. The approach does not replace safety training or supervision but provides a provable, model-agnostic safety layer that leverages probabilities and overlap to mitigate risks that are hard to detect by inspection alone. Limitations include dependence on multiple safe models with sufficient overlap, potential information leakage across repeated uses, and the need to address broader societal harms beyond unsafe outputs; future work points to increasing overlap, canonical distributions, and integrating this mechanism into broader safety pipelines.
Abstract
Many approaches to AI safety rely on inspecting model outputs or activations, yet certain risks are inherently undetectable by inspection alone. We propose a complementary, architecture-agnostic approach that enhances safety through the aggregation of multiple generative models, with the aggregated model inheriting its safety from the safest subset of a given size among them. Specifically, we present a consensus sampling algorithm that, given $k$ models and a prompt, achieves risk competitive with the average risk of the safest $s$ of the $k$ models, where $s$ is a chosen parameter, while abstaining when there is insufficient agreement between them. The approach leverages the models' ability to compute output probabilities, and we bound the probability of abstention when sufficiently many models are safe and exhibit adequate agreement. The algorithm is inspired by the provable copyright protection algorithm of Vyas et al. (2023). It requires some overlap among safe models, offers no protection when all models are unsafe, and may accumulate risk over repeated use. Nonetheless, our results provide a new, model-agnostic approach for AI safety by amplifying safety guarantees from an unknown subset of models within a collection to that of a single reliable model.
