A Framework for Real-time Safeguarding the Text Generation of Large Language Model
Ximing Dong, Dayi Lin, Shaowei Wang, Ahmed E. Hassan
TL;DR
This paper addresses the risk of harmful outputs from large language models by introducing LLMSafeGuard, a lightweight real-time safeguarding framework that integrates a similarity-based external validator into decoding and employs a context-wise timing strategy to selectively validate candidate tokens. Unlike prior methods that require training external discriminators and tightly coupled control models, LLMSafeGuard relies on demonstration examples to define unsafe content and uses cosine similarity to filter candidates, enabling easy addition of new constraints. Empirical results on detoxification and copyright tasks show substantial reductions in toxicity (at least 38.6%) and improved safety-efficiency trade-offs, including a reduction in inference time by up to 24.2%, while preserving linguistic quality. The work demonstrates a practical, flexible approach to real-time text safeguarding with tunable parameters and open resources for further research and deployment.
Abstract
Large Language Models (LLMs) have significantly advanced natural language processing (NLP) tasks but also pose ethical and societal risks due to their propensity to generate harmful content. Existing methods have limitations, including the need for training specific control models and proactive intervention during text generation, that lead to quality degradation and increased computational overhead. To mitigate those limitations, we propose LLMSafeGuard, a lightweight real-time framework that integrates an external validator into decoding, rejecting unsafe outputs while allowing valid ones. We introduce a similarity-based validation approach, simplifying constraint introduction and eliminating the need for control model training. Additionally, LLMSafeGuard employs a context-wise timing selection strategy, intervening LLMs only when necessary. We evaluate LLMSafeGuard on detoxification and copyright safeguarding, demonstrating its superiority over SOTA baselines. In detoxification, LLMSafeGuard reduces toxic output by at least 38.6\% while preserving linguistic quality. Additionally, its context-wise timing selection cuts inference time by at least 24.2\% without compromising effectiveness.
