Safeguarding Large Language Models: A Survey

Yi Dong; Ronghui Mu; Yanghao Zhang; Siqi Sun; Tianle Zhang; Changshun Wu; Gaojie Jin; Yi Qi; Jinwei Hu; Jie Meng; Saddek Bensalem; Xiaowei Huang

Safeguarding Large Language Models: A Survey

Yi Dong, Ronghui Mu, Yanghao Zhang, Siqi Sun, Tianle Zhang, Changshun Wu, Gaojie Jin, Yi Qi, Jinwei Hu, Jie Meng, Saddek Bensalem, Xiaowei Huang

TL;DR

The paper surveys the safety mechanisms for large language models, detailing guardrail frameworks, evaluation methods, and attacks/defenses. It provides a taxonomy of guardrail properties (hallucination, fairness, privacy, robustness, toxicity, legality, OOD, uncertainty) and reviews a suite of design tools (Llama Guard, Nemo Guardrails, Guardrails AI, TruLens, Guidance AI, LMQL) and Python packages. It also surveys white-, black-, and gray-box jailbreaks, RAG-based risks, and both detection- and mitigation-based defenses, arguing for a cohesive guardrail design. The authors advocate a multidisciplinary, neural-symbolic approach implemented via a rigorous SDLC, and they discuss safeguards for autonomous LLM agents as a critical future direction. Overall, the work highlights the complexity of safeguarding LLMs and outlines practical pathways toward a comprehensive, auditable, and domain-aware guardrail ecosystem.

Abstract

In the burgeoning field of Large Language Models (LLMs), developing a robust safety mechanism, colloquially known as "safeguards" or "guardrails", has become imperative to ensure the ethical use of LLMs within prescribed boundaries. This article provides a systematic literature review on the current status of this critical mechanism. It discusses its major challenges and how it can be enhanced into a comprehensive mechanism dealing with ethical issues in various contexts. First, the paper elucidates the current landscape of safeguarding mechanisms that major LLM service providers and the open-source community employ. This is followed by the techniques to evaluate, analyze, and enhance some (un)desirable properties that a guardrail might want to enforce, such as hallucinations, fairness, privacy, and so on. Based on them, we review techniques to circumvent these controls (i.e., attacks), to defend the attacks, and to reinforce the guardrails. While the techniques mentioned above represent the current status and the active research trends, we also discuss several challenges that cannot be easily dealt with by the methods and present our vision on how to implement a comprehensive guardrail through the full consideration of multi-disciplinary approach, neural-symbolic method, and systems development lifecycle.

Safeguarding Large Language Models: A Survey

TL;DR

Abstract

Safeguarding Large Language Models: A Survey

Authors

TL;DR

Abstract

Table of Contents

Figures (13)