Shh, don't say that! Domain Certification in LLMs

Cornelius Emde; Alasdair Paren; Preetham Arvind; Maxime Kayser; Tom Rainforth; Thomas Lukasiewicz; Bernard Ghanem; Philip H. S. Torr; Adel Bibi

Shh, don't say that! Domain Certification in LLMs

Cornelius Emde, Alasdair Paren, Preetham Arvind, Maxime Kayser, Tom Rainforth, Thomas Lukasiewicz, Bernard Ghanem, Philip H. S. Torr, Adel Bibi

TL;DR

This work tackles the risk of LLM outputs venturing outside a deployer’s intended domain under adversarial prompts by introducing domain certification and the VALID algorithm. It formalizes atomic and domain-level guarantees tied to divergences between the deployed model and a domain-specific guide model trained on in-domain data, deriving provable upper bounds on out-of-domain outputs. VALID performs rejection sampling based on a length-normalized log-likelihood ratio, yielding an upper bound ${M_{L,G,k,T}}$ that is $ig[2^{kN_{\bm{y}}} TG({\bm{y}})\big]$-AC and ${[\max_{\bm{y}\in {\mathbb{F}}} 2^{kN_{\bm{y}}}TG({\bm{y}})]}$-DC with respect to the forbidden set, enabling practical domain protection. Empirical results across Shakespeare, CS News, and MedicalQA demonstrate meaningful certificates and favorable trade-offs between OOD detection and certification, including certified benchmarking on PubMedQA that preserves substantial task performance under provable safety constraints. The approach offers a scalable, provable guardrail for domain-restricted LLM deployments that can be implemented with relatively lightweight domain-specific guidance models.

Abstract

Large language models (LLMs) are often deployed to perform constrained tasks, with narrow domains. For example, customer support bots can be built on top of LLMs, relying on their broad language understanding and capabilities to enhance performance. However, these LLMs are adversarially susceptible, potentially generating outputs outside the intended domain. To formalize, assess, and mitigate this risk, we introduce domain certification; a guarantee that accurately characterizes the out-of-domain behavior of language models. We then propose a simple yet effective approach, which we call VALID that provides adversarial bounds as a certificate. Finally, we evaluate our method across a diverse set of datasets, demonstrating that it yields meaningful certificates, which bound the probability of out-of-domain samples tightly with minimum penalty to refusal behavior.

Shh, don't say that! Domain Certification in LLMs

TL;DR

Abstract

Shh, don't say that! Domain Certification in LLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (46)

Theorems & Definitions (10)