A Data-Centric Approach for Safe and Secure Large Language Models against Threatening and Toxic Content
Chaima Njeh, Haïfa Nakouri, Fehmi Jaafar
TL;DR
The paper tackles safety concerns in large language models by advocating a data-centric post-generation correction approach and introducing the BART-Corrective Model. A two-stage framework first detects unsafe outputs via LangKit and then detoxifies them using a fine-tuned BART, governed by a threshold parameter $\tau$ that schedules corrections based on toxicity scores $s_i^{LLM}$ and $s_i^{BART}$. Trained on the hh-rlhf dataset, the system demonstrates substantial toxicity and jail-breaking reductions across GPT-4, PaLM2, Mistral-7B, and Gemma-2b-it, with GPT-4 achieving about 15% toxicity and 21% jail-breaking reductions, PaLM2 about 28% and 5%, Mistral-7B around 26% and 23%, and Gemma-2b-it near 11.1% and 19%. A comparative analysis with a data-centric projection filter (ProFS) indicates that post-generation detoxification is more compatible with API-based LLMs and offers robust, scalable improvements, though ProFS provides complementary insights into embedding-level toxicity mitigation. Overall, the work presents a practical, scalable method to enhance LLM safety in real-world applications while reducing reliance on costly or brittle model fine-tuning and prompt engineering.
Abstract
Large Language Models (LLM) have made remarkable progress, but concerns about potential biases and harmful content persist. To address these apprehensions, we introduce a practical solution for ensuring LLM's safe and ethical use. Our novel approach focuses on a post-generation correction mechanism, the BART-Corrective Model, which adjusts generated content to ensure safety and security. Unlike relying solely on model fine-tuning or prompt engineering, our method provides a robust data-centric alternative for mitigating harmful content. We demonstrate the effectiveness of our approach through experiments on multiple toxic datasets, which show a significant reduction in mean toxicity and jail-breaking scores after integration. Specifically, our results show a reduction of 15% and 21% in mean toxicity and jail-breaking scores with GPT-4, a substantial reduction of 28% and 5% with PaLM2, a reduction of approximately 26% and 23% with Mistral-7B, and a reduction of 11.1% and 19% with Gemma-2b-it. These results demonstrate the potential of our approach to improve the safety and security of LLM, making them more suitable for real-world applications.
