Table of Contents
Fetching ...

Class-RAG: Real-Time Content Moderation with Retrieval Augmented Generation

Jianfa Chen, Emily Shen, Trupti Bavalatti, Xiaowen Lin, Yongkai Wang, Shuming Hu, Harihar Subramanyam, Ksheeraj Sai Vepuri, Ming Jiang, Ji Qi, Li Chen, Nan Jiang, Ankit Jain

TL;DR

Class-RAG introduces a real-time content moderation framework that augments a fine-tuned LLM with a dynamically updatable retrieval library, enabling rapid risk mitigation through semantic hotfixing. By retrieving and incorporating both safe and unsafe exemplars and explanations, the system achieves superior classification performance and robustness to adversarial obfuscations compared to baselines such as WPIE and LLAMA3. The study shows that moderation quality improves with larger external retrieval libraries and more reference examples, enabling effective adaptation to out-of-distribution data without retraining. Overall, Class-RAG offers a scalable, interpretable, and cost-efficient approach for robust safety in Generative AI applications, with strong potential for production deployment and ongoing policy updates.

Abstract

Robust content moderation classifiers are essential for the safety of Generative AI systems. In this task, differences between safe and unsafe inputs are often extremely subtle, making it difficult for classifiers (and indeed, even humans) to properly distinguish violating vs. benign samples without context or explanation. Scaling risk discovery and mitigation through continuous model fine-tuning is also slow, challenging and costly, preventing developers from being able to respond quickly and effectively to emergent harms. We propose a Classification approach employing Retrieval-Augmented Generation (Class-RAG). Class-RAG extends the capability of its base LLM through access to a retrieval library which can be dynamically updated to enable semantic hotfixing for immediate, flexible risk mitigation. Compared to model fine-tuning, Class-RAG demonstrates flexibility and transparency in decision-making, outperforms on classification and is more robust against adversarial attack, as evidenced by empirical studies. Our findings also suggest that Class-RAG performance scales with retrieval library size, indicating that increasing the library size is a viable and low-cost approach to improve content moderation.

Class-RAG: Real-Time Content Moderation with Retrieval Augmented Generation

TL;DR

Class-RAG introduces a real-time content moderation framework that augments a fine-tuned LLM with a dynamically updatable retrieval library, enabling rapid risk mitigation through semantic hotfixing. By retrieving and incorporating both safe and unsafe exemplars and explanations, the system achieves superior classification performance and robustness to adversarial obfuscations compared to baselines such as WPIE and LLAMA3. The study shows that moderation quality improves with larger external retrieval libraries and more reference examples, enabling effective adaptation to out-of-distribution data without retraining. Overall, Class-RAG offers a scalable, interpretable, and cost-efficient approach for robust safety in Generative AI applications, with strong potential for production deployment and ongoing policy updates.

Abstract

Robust content moderation classifiers are essential for the safety of Generative AI systems. In this task, differences between safe and unsafe inputs are often extremely subtle, making it difficult for classifiers (and indeed, even humans) to properly distinguish violating vs. benign samples without context or explanation. Scaling risk discovery and mitigation through continuous model fine-tuning is also slow, challenging and costly, preventing developers from being able to respond quickly and effectively to emergent harms. We propose a Classification approach employing Retrieval-Augmented Generation (Class-RAG). Class-RAG extends the capability of its base LLM through access to a retrieval library which can be dynamically updated to enable semantic hotfixing for immediate, flexible risk mitigation. Compared to model fine-tuning, Class-RAG demonstrates flexibility and transparency in decision-making, outperforms on classification and is more robust against adversarial attack, as evidenced by empirical studies. Our findings also suggest that Class-RAG performance scales with retrieval library size, indicating that increasing the library size is a viable and low-cost approach to improve content moderation.

Paper Structure

This paper contains 34 sections, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Architecture of Class-RAG. For comparison, Llama Guard is depicted without a retrieval model.
  • Figure 2: Impact of external retrieval library size (top) and reference example number (bottom) on average AUPRC. Detailed results are presented in Tables \ref{['tab:lib_size_scores']} and Table \ref{['tab:ref_num']}, respectively.
  • Figure 3: Instruction template to generate explanation for retrieval library
  • Figure 4: Instruction template to generate reasoning response
  • Figure 5: An example of Class-RAG training data
  • ...and 2 more figures