Table of Contents
Fetching ...

Trust The Typical

Debargha Ganguly, Sreehari Sankar, Biyao Zhang, Vikash Singh, Kanan Gupta, Harshini Kavuru, Alan Luo, Weicong Chen, Warren Morningstar, Raghu Machiraju, Vipin Chaudhary

TL;DR

The paper tackles the brittleness of current LLM safety by reframing safety as a problem of typicality: learning the distribution of safe prompts and detecting departures as potential threats. It introduces Trust The Typical (T3), a Forte-inspired, text-focused OOD framework that uses three encoders to form a multi-view embedding, computes per-point PRDC metrics (Precision, Recall, Density, Coverage), and aggregates them into a joint representation $T(y_j) \in \mathbb{R}^{4K}$. Anomaly scores are obtained by fitting density models (GMM or OC-SVM) on safe data and evaluating $s(y_j) = -\log p_T(T(y_j))$, enabling proactive, domain- and language-agnostic safety without harmful-training data. Across 18 benchmarks, T3 achieves state-of-the-art AUROC and dramatic reductions in $\text{FPR@}95$ (up to 40x) while transferring effectively to 14+ languages and integrating with vLLM for real-time guardrailing with under 6% overhead, demonstrating strong practical viability. The work provides theoretical analysis of PRDC metrics under in- and out-of-distribution conditions and highlights the importance of careful in-distribution data curation for robust performance, while suggesting hybrid approaches to handle near-boundary cases.

Abstract

Current approaches to LLM safety fundamentally rely on a brittle cat-and-mouse game of identifying and blocking known threats via guardrails. We argue for a fresh approach: robust safety comes not from enumerating what is harmful, but from deeply understanding what is safe. We introduce Trust The Typical (T3), a framework that operationalizes this principle by treating safety as an out-of-distribution (OOD) detection problem. T3 learns the distribution of acceptable prompts in a semantic space and flags any significant deviation as a potential threat. Unlike prior methods, it requires no training on harmful examples, yet achieves state-of-the-art performance across 18 benchmarks spanning toxicity, hate speech, jailbreaking, multilingual harms, and over-refusal, reducing false positive rates by up to 40x relative to specialized safety models. A single model trained only on safe English text transfers effectively to diverse domains and over 14 languages without retraining. Finally, we demonstrate production readiness by integrating a GPU-optimized version into vLLM, enabling continuous guardrailing during token generation with less than 6% overhead even under dense evaluation intervals on large-scale workloads.

Trust The Typical

TL;DR

The paper tackles the brittleness of current LLM safety by reframing safety as a problem of typicality: learning the distribution of safe prompts and detecting departures as potential threats. It introduces Trust The Typical (T3), a Forte-inspired, text-focused OOD framework that uses three encoders to form a multi-view embedding, computes per-point PRDC metrics (Precision, Recall, Density, Coverage), and aggregates them into a joint representation . Anomaly scores are obtained by fitting density models (GMM or OC-SVM) on safe data and evaluating , enabling proactive, domain- and language-agnostic safety without harmful-training data. Across 18 benchmarks, T3 achieves state-of-the-art AUROC and dramatic reductions in (up to 40x) while transferring effectively to 14+ languages and integrating with vLLM for real-time guardrailing with under 6% overhead, demonstrating strong practical viability. The work provides theoretical analysis of PRDC metrics under in- and out-of-distribution conditions and highlights the importance of careful in-distribution data curation for robust performance, while suggesting hybrid approaches to handle near-boundary cases.

Abstract

Current approaches to LLM safety fundamentally rely on a brittle cat-and-mouse game of identifying and blocking known threats via guardrails. We argue for a fresh approach: robust safety comes not from enumerating what is harmful, but from deeply understanding what is safe. We introduce Trust The Typical (T3), a framework that operationalizes this principle by treating safety as an out-of-distribution (OOD) detection problem. T3 learns the distribution of acceptable prompts in a semantic space and flags any significant deviation as a potential threat. Unlike prior methods, it requires no training on harmful examples, yet achieves state-of-the-art performance across 18 benchmarks spanning toxicity, hate speech, jailbreaking, multilingual harms, and over-refusal, reducing false positive rates by up to 40x relative to specialized safety models. A single model trained only on safe English text transfers effectively to diverse domains and over 14 languages without retraining. Finally, we demonstrate production readiness by integrating a GPU-optimized version into vLLM, enabling continuous guardrailing during token generation with less than 6% overhead even under dense evaluation intervals on large-scale workloads.
Paper Structure (42 sections, 5 theorems, 22 equations, 4 figures, 10 tables)

This paper contains 42 sections, 5 theorems, 22 equations, 4 figures, 10 tables.

Key Result

Theorem 3.1

When test and reference samples are drawn from the same distribution:

Figures (4)

  • Figure 1: Geometric concentration of safe text embeddings in high-dimensional space. The distribution of Euclidean distances from the mean for 10,000 safe embeddings (Alpaca, d=1024) empirically validates the concentration of measure phenomenon. (a, d) The distances closely follow a theoretical $\chi_{1024}$ distribution, confirmed by a Q-Q plot ($R^2 > 0.99$). (b, c, f) This results in a concentrated "typical set” where 90% of data forms an annulus ("hollow sphere”) around the mean, a structure visible even in 2D PCA projections. (e) As predicted by theory, this concentration tightens relative to the dimension ($O(d^{-1/2})$).
  • Figure 2: Distinguishing safe vs. toxic text using geometric typicality. This figure compares simple Euclidean and Mahalanobis distances for separating 10,000 safe and 2,000 toxic embeddings. (a, b) Mahalanobis distance, which accounts for the safe data's covariance, provides far better separation between safe (green) and toxic (red) distributions. (c, d) This superiority is quantified by a significantly higher ROC AUC (0.944 vs. 0.733) and confirmed by box plots. (e) A 2D PCA projection visually confirms that toxic samples fall predominantly outside the 95% typical set boundary of safe data.
  • Figure 3: T3 is highly sample-efficient, avoiding the cold start problem. T3's detection performance (AUROC) rapidly converges to $\approx90\%$ with as few as 1000 in-distribution training samples, demonstrating its ability to learn the manifold of safe usage from a small, curated dataset.
  • Figure 4: NVIDIA Nsight Systems profiling of vLLM baseline vs. vLLM+T3. (a) Full execution timeline comparison. (b) Zoomed-in view showing kernel concurrency and reduced GPU bubbles in vLLM+T3. (c) Conceptual illustration of overlapping inference kernels (Worker Processes) with T3 prediction kernels (Main Process). The integration reduces idle GPU periods between consecutive generations, improving utilization while preserving low-latency inference.

Theorems & Definitions (9)

  • Theorem 3.1: Expected Values under the null hypothesis
  • Theorem A.1
  • proof
  • Theorem A.2
  • proof
  • Definition A.3: Schilling's $T_{k,N}$ Statistic
  • Theorem A.4: Asymptotics of $T_{k,N}$ (Schilling, 1986, Thm. 3.1 and 3.4)
  • Lemma A.5
  • proof