Detectors for Safe and Reliable LLMs: Implementations, Uses, and Limitations

Swapnaja Achintalwar; Adriana Alvarado Garcia; Ateret Anaby-Tavor; Ioana Baldini; Sara E. Berger; Bishwaranjan Bhattacharjee; Djallel Bouneffouf; Subhajit Chaudhury; Pin-Yu Chen; Lamogha Chiazor; Elizabeth M. Daly; Kirushikesh DB; Rogério Abreu de Paula; Pierre Dognin; Eitan Farchi; Soumya Ghosh; Michael Hind; Raya Horesh; George Kour; Ja Young Lee; Nishtha Madaan; Sameep Mehta; Erik Miehling; Keerthiram Murugesan; Manish Nagireddy; Inkit Padhi; David Piorkowski; Ambrish Rawat; Orna Raz; Prasanna Sattigeri; Hendrik Strobelt; Sarathkrishna Swaminathan; Christoph Tillmann; Aashka Trivedi; Kush R. Varshney; Dennis Wei; Shalisha Witherspooon; Marcel Zalmanovici

Detectors for Safe and Reliable LLMs: Implementations, Uses, and Limitations

Swapnaja Achintalwar, Adriana Alvarado Garcia, Ateret Anaby-Tavor, Ioana Baldini, Sara E. Berger, Bishwaranjan Bhattacharjee, Djallel Bouneffouf, Subhajit Chaudhury, Pin-Yu Chen, Lamogha Chiazor, Elizabeth M. Daly, Kirushikesh DB, Rogério Abreu de Paula, Pierre Dognin, Eitan Farchi, Soumya Ghosh, Michael Hind, Raya Horesh, George Kour, Ja Young Lee, Nishtha Madaan, Sameep Mehta, Erik Miehling, Keerthiram Murugesan, Manish Nagireddy, Inkit Padhi, David Piorkowski, Ambrish Rawat, Orna Raz, Prasanna Sattigeri, Hendrik Strobelt, Sarathkrishna Swaminathan, Christoph Tillmann, Aashka Trivedi, Kush R. Varshney, Dennis Wei, Shalisha Witherspooon, Marcel Zalmanovici

TL;DR

The paper addresses the challenge of safely deploying LLMs by developing compact, auxiliary detectors that label harms in prompts and outputs. It details an end-to-end pipeline—taxonomy, synthetic data augmentation, real-world evaluation, human-in-the-loop interfaces, and uncertainty calibration—to build robust detectors that function across the LLM lifecycle. Detectors are presented as multipurpose tools for guardrails, evaluation benchmarks, RLHF alignment, data filtering, and governance, while acknowledging challenges related to stigma, context, and annotation bias. The authors outline inherent societal and methodological challenges and propose future directions, including multi-turn detection and better interpretability, to enhance practical impact in deploying safe, reliable LLMs.

Abstract

Large language models (LLMs) are susceptible to a variety of risks, from non-faithful output to biased and toxic generations. Due to several limiting factors surrounding LLMs (training cost, API access, data availability, etc.), it may not always be feasible to impose direct safety constraints on a deployed model. Therefore, an efficient and reliable alternative is required. To this end, we present our ongoing efforts to create and deploy a library of detectors: compact and easy-to-build classification models that provide labels for various harms. In addition to the detectors themselves, we discuss a wide range of uses for these detector models - from acting as guardrails to enabling effective AI governance. We also deep dive into inherent challenges in their development and discuss future work aimed at making the detectors more reliable and broadening their scope.

Detectors for Safe and Reliable LLMs: Implementations, Uses, and Limitations

TL;DR

Abstract

Paper Structure (31 sections, 5 figures, 2 tables)

This paper contains 31 sections, 5 figures, 2 tables.

Introduction
Development of the Detectors
Use of synthetic data generation
Evaluating detectors on real-world data
Interface design for human input
Reliable uncertainties
Uses of Detectors
Guardrails
Red-Teaming
Evaluation
Reliability and Efficiency
Automated Benchmarking
Other aspects of LLM governance
Inherent Challenges
A closer look into the stigma detector
...and 16 more sections

Figures (5)

Figure 1: The role of the detectors in the LLM life-cycle. Apart from acting as guardrails, the evaluation provided by the detectors is used to refine both the pre-processing (including data curation) and tuning steps (including fine-tuning, reprogramming, prompt-tuning, and post-processing).
Figure 2: Red Teaming + Guardrails UI (see full figure in Appendix \ref{['appendix:UI']}, Figure \ref{['fig:sys-full']}) A user interface which encourages interactive probing of both generative models and the detectors themselves. More details in \ref{['sec:system']}
Figure 3: Examples of synthetic data with associated questions, gaps, and assumptions.
Figure 4: Various detector modes. In the single-turn setting, detectors can either monitor the (a) prompt, (b) response, or (c) the prompt and response. The multi-turn setting (d) describes monitoring of a given response subject to the context provided by the history of prompts and past responses.
Figure 5: Red Teaming + Guardrails UI: A user interface which encourages interactive probing of both generative models and the detectors themselves. More details in \ref{['sec:system']}

Detectors for Safe and Reliable LLMs: Implementations, Uses, and Limitations

TL;DR

Abstract

Detectors for Safe and Reliable LLMs: Implementations, Uses, and Limitations

Authors

TL;DR

Abstract

Table of Contents

Figures (5)