Defending Large Language Models Against Jailbreak Exploits with Responsible AI Considerations
Ryan Wong, Hosea David Yu Fei Ng, Dhananjai Sharma, Glenn Jun Jie Ng, Kavishvaran Srinivasan
TL;DR
This paper tackles the vulnerability of large language models to jailbreak exploits by proposing a structured taxonomy of defenses along the LLM pipeline and three complementary strategies that embed safety into prompting, modeling, and training. It introduces a Prompt-Level Defense Framework, a Logit-Based Steering Defense, and a MetaGPT-based Domain-Specific Agent Defense, and evaluates them on jailbreak benchmarks with aligned and unaligned models. The results show substantial reductions in jailbreak success, including full mitigation under the agent-based domain-defense pipeline, while highlighting trade-offs in safety, performance, and scalability. The work advances Responsible AI by identifying concrete intervention points and practical defense designs for safer real-world LLM deployments.
Abstract
Large Language Models (LLMs) remain susceptible to jailbreak exploits that bypass safety filters and induce harmful or unethical behavior. This work presents a systematic taxonomy of existing jailbreak defenses across prompt-level, model-level, and training-time interventions, followed by three proposed defense strategies. First, a Prompt-Level Defense Framework detects and neutralizes adversarial inputs through sanitization, paraphrasing, and adaptive system guarding. Second, a Logit-Based Steering Defense reinforces refusal behavior through inference-time vector steering in safety-sensitive layers. Third, a Domain-Specific Agent Defense employs the MetaGPT framework to enforce structured, role-based collaboration and domain adherence. Experiments on benchmark datasets show substantial reductions in attack success rate, achieving full mitigation under the agent-based defense. Overall, this study highlights how jailbreaks pose a significant security threat to LLMs and identifies key intervention points for prevention, while noting that defense strategies often involve trade-offs between safety, performance, and scalability. Code is available at: https://github.com/Kuro0911/CS5446-Project
