Table of Contents
Fetching ...

AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts

Shaona Ghosh, Prasoon Varshney, Erick Galinkin, Christopher Parisien

TL;DR

This work tackles the safety risks of large language models by proposing a comprehensive online adaptive moderation framework, Aegis. It introduces a broad 13+9-category safety taxonomy, a large human-annotated dataset (AegisSafetyDataset), and a diverse ensemble of safety experts (AegisSafetyExperts) trained with LoRA on the dataset. The core novelty lies in deploying a no-regret online adaptation framework (Aegis) that dynamically selects among safety experts, improving robustness to distribution shifts and jailbreak attempts. Empirical results show strong cross-dataset performance, jailbreak resilience, and compatibility with alignment objectives, with plans to release datasets and models for community research and deployment guidance.

Abstract

As Large Language Models (LLMs) and generative AI become more widespread, the content safety risks associated with their use also increase. We find a notable deficiency in high-quality content safety datasets and benchmarks that comprehensively cover a wide range of critical safety areas. To address this, we define a broad content safety risk taxonomy, comprising 13 critical risk and 9 sparse risk categories. Additionally, we curate AEGISSAFETYDATASET, a new dataset of approximately 26, 000 human-LLM interaction instances, complete with human annotations adhering to the taxonomy. We plan to release this dataset to the community to further research and to help benchmark LLM models for safety. To demonstrate the effectiveness of the dataset, we instruction-tune multiple LLM-based safety models. We show that our models (named AEGISSAFETYEXPERTS), not only surpass or perform competitively with the state-of-the-art LLM-based safety models and general purpose LLMs, but also exhibit robustness across multiple jail-break attack categories. We also show how using AEGISSAFETYDATASET during the LLM alignment phase does not negatively impact the performance of the aligned models on MT Bench scores. Furthermore, we propose AEGIS, a novel application of a no-regret online adaptation framework with strong theoretical guarantees, to perform content moderation with an ensemble of LLM content safety experts in deployment

AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts

TL;DR

This work tackles the safety risks of large language models by proposing a comprehensive online adaptive moderation framework, Aegis. It introduces a broad 13+9-category safety taxonomy, a large human-annotated dataset (AegisSafetyDataset), and a diverse ensemble of safety experts (AegisSafetyExperts) trained with LoRA on the dataset. The core novelty lies in deploying a no-regret online adaptation framework (Aegis) that dynamically selects among safety experts, improving robustness to distribution shifts and jailbreak attempts. Empirical results show strong cross-dataset performance, jailbreak resilience, and compatibility with alignment objectives, with plans to release datasets and models for community research and deployment guidance.

Abstract

As Large Language Models (LLMs) and generative AI become more widespread, the content safety risks associated with their use also increase. We find a notable deficiency in high-quality content safety datasets and benchmarks that comprehensively cover a wide range of critical safety areas. To address this, we define a broad content safety risk taxonomy, comprising 13 critical risk and 9 sparse risk categories. Additionally, we curate AEGISSAFETYDATASET, a new dataset of approximately 26, 000 human-LLM interaction instances, complete with human annotations adhering to the taxonomy. We plan to release this dataset to the community to further research and to help benchmark LLM models for safety. To demonstrate the effectiveness of the dataset, we instruction-tune multiple LLM-based safety models. We show that our models (named AEGISSAFETYEXPERTS), not only surpass or perform competitively with the state-of-the-art LLM-based safety models and general purpose LLMs, but also exhibit robustness across multiple jail-break attack categories. We also show how using AEGISSAFETYDATASET during the LLM alignment phase does not negatively impact the performance of the aligned models on MT Bench scores. Furthermore, we propose AEGIS, a novel application of a no-regret online adaptation framework with strong theoretical guarantees, to perform content moderation with an ensemble of LLM content safety experts in deployment
Paper Structure (24 sections, 2 equations, 7 figures, 9 tables, 1 algorithm)

This paper contains 24 sections, 2 equations, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: Aegis: online adaptive safety content moderation
  • Figure 2: Aegis learns to choose the best expert over the time horizon. The EW algorithm is shown on the left and the perturbed EW version is on the right. EW enables the learner to latch on to the best expert from the start. If that expert starts performing poorly, EW may remain with that expert and very slowly adapt to the current best performing one over the time horizon. With the perturbed version, the learner can switch between experts with a randomness.
  • Figure 3: EW with perturbation averaged over $20$ trials
  • Figure 4: Heatmaps showing model prediction categories versus the ground truth critical risk categories of the OpenAI Moderation Dataset.
  • Figure 5: Performance on SimpleSafetyTests benchmark across HarmType and Elicitation Types
  • ...and 2 more figures