Trust The Typical
Debargha Ganguly, Sreehari Sankar, Biyao Zhang, Vikash Singh, Kanan Gupta, Harshini Kavuru, Alan Luo, Weicong Chen, Warren Morningstar, Raghu Machiraju, Vipin Chaudhary
TL;DR
The paper tackles the brittleness of current LLM safety by reframing safety as a problem of typicality: learning the distribution of safe prompts and detecting departures as potential threats. It introduces Trust The Typical (T3), a Forte-inspired, text-focused OOD framework that uses three encoders to form a multi-view embedding, computes per-point PRDC metrics (Precision, Recall, Density, Coverage), and aggregates them into a joint representation $T(y_j) \in \mathbb{R}^{4K}$. Anomaly scores are obtained by fitting density models (GMM or OC-SVM) on safe data and evaluating $s(y_j) = -\log p_T(T(y_j))$, enabling proactive, domain- and language-agnostic safety without harmful-training data. Across 18 benchmarks, T3 achieves state-of-the-art AUROC and dramatic reductions in $\text{FPR@}95$ (up to 40x) while transferring effectively to 14+ languages and integrating with vLLM for real-time guardrailing with under 6% overhead, demonstrating strong practical viability. The work provides theoretical analysis of PRDC metrics under in- and out-of-distribution conditions and highlights the importance of careful in-distribution data curation for robust performance, while suggesting hybrid approaches to handle near-boundary cases.
Abstract
Current approaches to LLM safety fundamentally rely on a brittle cat-and-mouse game of identifying and blocking known threats via guardrails. We argue for a fresh approach: robust safety comes not from enumerating what is harmful, but from deeply understanding what is safe. We introduce Trust The Typical (T3), a framework that operationalizes this principle by treating safety as an out-of-distribution (OOD) detection problem. T3 learns the distribution of acceptable prompts in a semantic space and flags any significant deviation as a potential threat. Unlike prior methods, it requires no training on harmful examples, yet achieves state-of-the-art performance across 18 benchmarks spanning toxicity, hate speech, jailbreaking, multilingual harms, and over-refusal, reducing false positive rates by up to 40x relative to specialized safety models. A single model trained only on safe English text transfers effectively to diverse domains and over 14 languages without retraining. Finally, we demonstrate production readiness by integrating a GPU-optimized version into vLLM, enabling continuous guardrailing during token generation with less than 6% overhead even under dense evaluation intervals on large-scale workloads.
