Table of Contents
Fetching ...

Consensus Sampling for Safer Generative AI

Adam Tauman Kalai, Yael Tauman Kalai, Or Zamir

TL;DR

This work introduces consensus sampling, a model-aggregation approach that assigns safety to the aggregated output based on agreement across a set of $k$ models and a trusted safe subset of size $s$, outputting a sample $y$ only when sufficient overlap among safe models exists and abstaining otherwise. It provides an efficient sampling algorithm with risk guarantees: the output risk is at most $R$ times the average risk of the safest $s$ models, and abstention probability decays exponentially with $R$ under reasonable overlap conditions; the abstention mechanism serves as a safety brake when overlap is insufficient. The framework builds on information-theoretic ideas and cryptographic perspectives, introducing $R$-robustness and showing near-optimality of the sampling distribution $q^*$ with respect to safety and abstention tradeoffs, while bounding information leakage to $O( obreak log(R+1))$ bits. The approach does not replace safety training or supervision but provides a provable, model-agnostic safety layer that leverages probabilities and overlap to mitigate risks that are hard to detect by inspection alone. Limitations include dependence on multiple safe models with sufficient overlap, potential information leakage across repeated uses, and the need to address broader societal harms beyond unsafe outputs; future work points to increasing overlap, canonical distributions, and integrating this mechanism into broader safety pipelines.

Abstract

Many approaches to AI safety rely on inspecting model outputs or activations, yet certain risks are inherently undetectable by inspection alone. We propose a complementary, architecture-agnostic approach that enhances safety through the aggregation of multiple generative models, with the aggregated model inheriting its safety from the safest subset of a given size among them. Specifically, we present a consensus sampling algorithm that, given $k$ models and a prompt, achieves risk competitive with the average risk of the safest $s$ of the $k$ models, where $s$ is a chosen parameter, while abstaining when there is insufficient agreement between them. The approach leverages the models' ability to compute output probabilities, and we bound the probability of abstention when sufficiently many models are safe and exhibit adequate agreement. The algorithm is inspired by the provable copyright protection algorithm of Vyas et al. (2023). It requires some overlap among safe models, offers no protection when all models are unsafe, and may accumulate risk over repeated use. Nonetheless, our results provide a new, model-agnostic approach for AI safety by amplifying safety guarantees from an unknown subset of models within a collection to that of a single reliable model.

Consensus Sampling for Safer Generative AI

TL;DR

This work introduces consensus sampling, a model-aggregation approach that assigns safety to the aggregated output based on agreement across a set of models and a trusted safe subset of size , outputting a sample only when sufficient overlap among safe models exists and abstaining otherwise. It provides an efficient sampling algorithm with risk guarantees: the output risk is at most times the average risk of the safest models, and abstention probability decays exponentially with under reasonable overlap conditions; the abstention mechanism serves as a safety brake when overlap is insufficient. The framework builds on information-theoretic ideas and cryptographic perspectives, introducing -robustness and showing near-optimality of the sampling distribution with respect to safety and abstention tradeoffs, while bounding information leakage to bits. The approach does not replace safety training or supervision but provides a provable, model-agnostic safety layer that leverages probabilities and overlap to mitigate risks that are hard to detect by inspection alone. Limitations include dependence on multiple safe models with sufficient overlap, potential information leakage across repeated uses, and the need to address broader societal harms beyond unsafe outputs; future work points to increasing overlap, canonical distributions, and integrating this mechanism into broader safety pipelines.

Abstract

Many approaches to AI safety rely on inspecting model outputs or activations, yet certain risks are inherently undetectable by inspection alone. We propose a complementary, architecture-agnostic approach that enhances safety through the aggregation of multiple generative models, with the aggregated model inheriting its safety from the safest subset of a given size among them. Specifically, we present a consensus sampling algorithm that, given models and a prompt, achieves risk competitive with the average risk of the safest of the models, where is a chosen parameter, while abstaining when there is insufficient agreement between them. The approach leverages the models' ability to compute output probabilities, and we bound the probability of abstention when sufficiently many models are safe and exhibit adequate agreement. The algorithm is inspired by the provable copyright protection algorithm of Vyas et al. (2023). It requires some overlap among safe models, offers no protection when all models are unsafe, and may accumulate risk over repeated use. Nonetheless, our results provide a new, model-agnostic approach for AI safety by amplifying safety guarantees from an unknown subset of models within a collection to that of a single reliable model.

Paper Structure

This paper contains 28 sections, 12 theorems, 38 equations, 2 figures, 2 algorithms.

Key Result

Theorem 4.1

Let $S\subseteq [k]$ be of size $|S|=s>k/2$ with $\Delta(S)>0$. For any set of "unsafe" outputs $U\subseteq Y$ and any $p_1,p_2,\ldots, p_k \in \mathsf{Distr}\!\left(Y\right)$,

Figures (2)

  • Figure 1: Stylized examples of risks that may be provably hard to detect or remove from outputs alone: (a) steganographic encoding that could facilitate unauthorized information transfer; (b) security vulnerabilities embedded in generated code. Using output probabilities from multiple models may help mitigate some of these risks.
  • Figure 2: Suppose an adversarial model has a distribution uniform over unsafe responses, shown in red, while safe distributions are shown in silver. Left: with sufficient overlap among safe distributions, consensus sampling returns a point from the overlap region, which is mostly safe. Right: with no overlap between safe distributions, the algorithm abstains.

Theorems & Definitions (31)

  • Theorem 4.1: Median distribution safety
  • proof
  • Lemma 5.1: Efficiency
  • proof
  • Lemma 5.2
  • proof
  • Definition 6.1: Consensus robustness
  • Theorem 6.2: Consensus robustness
  • proof
  • Corollary 6.3: Adversarial robustness
  • ...and 21 more