Table of Contents
Fetching ...

Shh, don't say that! Domain Certification in LLMs

Cornelius Emde, Alasdair Paren, Preetham Arvind, Maxime Kayser, Tom Rainforth, Thomas Lukasiewicz, Bernard Ghanem, Philip H. S. Torr, Adel Bibi

TL;DR

This work tackles the risk of LLM outputs venturing outside a deployer’s intended domain under adversarial prompts by introducing domain certification and the VALID algorithm. It formalizes atomic and domain-level guarantees tied to divergences between the deployed model and a domain-specific guide model trained on in-domain data, deriving provable upper bounds on out-of-domain outputs. VALID performs rejection sampling based on a length-normalized log-likelihood ratio, yielding an upper bound ${M_{L,G,k,T}}$ that is $ig[2^{kN_{\bm{y}}} TG({\bm{y}})\big]$-AC and ${[\max_{\bm{y}\in {\mathbb{F}}} 2^{kN_{\bm{y}}}TG({\bm{y}})]}$-DC with respect to the forbidden set, enabling practical domain protection. Empirical results across Shakespeare, CS News, and MedicalQA demonstrate meaningful certificates and favorable trade-offs between OOD detection and certification, including certified benchmarking on PubMedQA that preserves substantial task performance under provable safety constraints. The approach offers a scalable, provable guardrail for domain-restricted LLM deployments that can be implemented with relatively lightweight domain-specific guidance models.

Abstract

Large language models (LLMs) are often deployed to perform constrained tasks, with narrow domains. For example, customer support bots can be built on top of LLMs, relying on their broad language understanding and capabilities to enhance performance. However, these LLMs are adversarially susceptible, potentially generating outputs outside the intended domain. To formalize, assess, and mitigate this risk, we introduce domain certification; a guarantee that accurately characterizes the out-of-domain behavior of language models. We then propose a simple yet effective approach, which we call VALID that provides adversarial bounds as a certificate. Finally, we evaluate our method across a diverse set of datasets, demonstrating that it yields meaningful certificates, which bound the probability of out-of-domain samples tightly with minimum penalty to refusal behavior.

Shh, don't say that! Domain Certification in LLMs

TL;DR

This work tackles the risk of LLM outputs venturing outside a deployer’s intended domain under adversarial prompts by introducing domain certification and the VALID algorithm. It formalizes atomic and domain-level guarantees tied to divergences between the deployed model and a domain-specific guide model trained on in-domain data, deriving provable upper bounds on out-of-domain outputs. VALID performs rejection sampling based on a length-normalized log-likelihood ratio, yielding an upper bound that is -AC and -DC with respect to the forbidden set, enabling practical domain protection. Empirical results across Shakespeare, CS News, and MedicalQA demonstrate meaningful certificates and favorable trade-offs between OOD detection and certification, including certified benchmarking on PubMedQA that preserves substantial task performance under provable safety constraints. The approach offers a scalable, provable guardrail for domain-restricted LLM deployments that can be implemented with relatively lightweight domain-specific guidance models.

Abstract

Large language models (LLMs) are often deployed to perform constrained tasks, with narrow domains. For example, customer support bots can be built on top of LLMs, relying on their broad language understanding and capabilities to enhance performance. However, these LLMs are adversarially susceptible, potentially generating outputs outside the intended domain. To formalize, assess, and mitigate this risk, we introduce domain certification; a guarantee that accurately characterizes the out-of-domain behavior of language models. We then propose a simple yet effective approach, which we call VALID that provides adversarial bounds as a certificate. Finally, we evaluate our method across a diverse set of datasets, demonstrating that it yields meaningful certificates, which bound the probability of out-of-domain samples tightly with minimum penalty to refusal behavior.

Paper Structure

This paper contains 43 sections, 6 theorems, 28 equations, 46 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

Let $L$ be an LLM and $G$ a guide model as described above. Rejection sampling as described in Algorithm alg:cap with rejection threshold $k$ and up to $T$ iterations defines model $M_{L,G,k,T}$ with $M_{L,G,k,T}({\bm{y}}|{\bm{x}})$ denoting the likelihood of ${\bm{y}}$ given ${\bm{x}}$. Let $N_{{\b Hence, $M_{L,G,k,T}$ is $[2^{kN_{{\bm{y}}}}TG({\bm{y}})]$-AC and, further, it is $[\max_{{\bm{y}} \

Figures (46)

  • Figure 1: A user misappropriating an LLM system using an adversarial attack. We provide certificates to mitigate this risk.
  • Figure 2: Log likelihood ratios scale in the sequence length $N_{{\bm{y}}}$. Six artificial examples of sentences with length 1 to 10 are shown for the ID and OOD dataset. As log ratios scale, so should the decision boundary.
  • Figure 3:
  • Figure 4:
  • Figure 5:
  • ...and 41 more figures

Theorems & Definitions (10)

  • Definition 1
  • Definition 2
  • Theorem 1: Certificate
  • Theorem 1: Certificate
  • Lemma 1: Equivalence of Divergence
  • Lemma 2: Likelihood of $M$
  • Remark 1: Estimating likelihood
  • Lemma 3: Expected number of iterations in
  • Remark 2
  • Lemma 4: Adversary under Rejection Sampling