Table of Contents
Fetching ...

Legilimens: Practical and Unified Content Moderation for Large Language Model Services

Jialin Wu, Jiangyi Deng, Shengyuan Pang, Yanjiao Chen, Jiayang Xu, Xinfeng Li, Wenyuan Xu

TL;DR

It is revealed for the first time that effective and efficient content moderation can be achieved by extracting conceptual features from chat-oriented LLMs, despite their initial fine-tuning for conversation rather than content moderation.

Abstract

Given the societal impact of unsafe content generated by large language models (LLMs), ensuring that LLM services comply with safety standards is a crucial concern for LLM service providers. Common content moderation methods are limited by an effectiveness-and-efficiency dilemma, where simple models are fragile while sophisticated models consume excessive computational resources. In this paper, we reveal for the first time that effective and efficient content moderation can be achieved by extracting conceptual features from chat-oriented LLMs, despite their initial fine-tuning for conversation rather than content moderation. We propose a practical and unified content moderation framework for LLM services, named Legilimens, which features both effectiveness and efficiency. Our red-team model-based data augmentation enhances the robustness of Legilimens against state-of-the-art jailbreaking. Additionally, we develop a framework to theoretically analyze the cost-effectiveness of Legilimens compared to other methods. We have conducted extensive experiments on five host LLMs, seventeen datasets, and nine jailbreaking methods to verify the effectiveness, efficiency, and robustness of Legilimens against normal and adaptive adversaries. A comparison of Legilimens with both commercial and academic baselines demonstrates the superior performance of Legilimens. Furthermore, we confirm that Legilimens can be applied to few-shot scenarios and extended to multi-label classification tasks.

Legilimens: Practical and Unified Content Moderation for Large Language Model Services

TL;DR

It is revealed for the first time that effective and efficient content moderation can be achieved by extracting conceptual features from chat-oriented LLMs, despite their initial fine-tuning for conversation rather than content moderation.

Abstract

Given the societal impact of unsafe content generated by large language models (LLMs), ensuring that LLM services comply with safety standards is a crucial concern for LLM service providers. Common content moderation methods are limited by an effectiveness-and-efficiency dilemma, where simple models are fragile while sophisticated models consume excessive computational resources. In this paper, we reveal for the first time that effective and efficient content moderation can be achieved by extracting conceptual features from chat-oriented LLMs, despite their initial fine-tuning for conversation rather than content moderation. We propose a practical and unified content moderation framework for LLM services, named Legilimens, which features both effectiveness and efficiency. Our red-team model-based data augmentation enhances the robustness of Legilimens against state-of-the-art jailbreaking. Additionally, we develop a framework to theoretically analyze the cost-effectiveness of Legilimens compared to other methods. We have conducted extensive experiments on five host LLMs, seventeen datasets, and nine jailbreaking methods to verify the effectiveness, efficiency, and robustness of Legilimens against normal and adaptive adversaries. A comparison of Legilimens with both commercial and academic baselines demonstrates the superior performance of Legilimens. Furthermore, we confirm that Legilimens can be applied to few-shot scenarios and extended to multi-label classification tasks.
Paper Structure (54 sections, 8 equations, 4 figures, 24 tables)

This paper contains 54 sections, 8 equations, 4 figures, 24 tables.

Figures (4)

  • Figure 1: Legilimens is a unified content moderation framework for almost all LLM services, which features both effectiveness and efficiency. By monitoring conceptual features generated in the regular inference process of LLMs, Legilimens achieves lightweight moderation.
  • Figure 2: Design of Legilimens. Legilimens utilizes the conceptual features during the decoding of the first token to accomplish $I$-moderation, and the conceptual features during the decoding of the last token to achieve $O$- and $IO$-moderation.
  • Figure 3: Performance in few-shot scenarios ($IO$-Moderation). Legilimens performs well even when only 1,000 samples are available in the training phase.
  • Figure 4: The time complexity of Legilimens compared with three baselines. Legilimens exhibits a constant complexity of $O(1)$, while the other methods exhibits an approximately linear complexity of $O(n)$.