LLMStinger: Jailbreaking LLMs using RL fine-tuned LLMs
Piyush Jha, Arnav Arora, Vijay Ganesh
TL;DR
This work tackles the problem of jailbreaking safety-trained LLMs by automating the generation of adversarial suffixes through a reinforcement learning loop. LLMStinger fine-tunes an attacker LLM (Gemma) using PPO with vector, token-level feedback from a string similarity checker and a judgment model, enabling rapid discovery of new suffixes under black-box access. In comprehensive evaluations on HarmBench across open and closed models, the method achieves substantial ASR gains (e.g., +57.2% on LLaMA-2-7B-chat, +50.3% on Claude 2; 94.97% on GPT-3.5; 99.4% on Gemma-2B-it), outperforming 15 baseline red-teaming techniques. The approach demonstrates scalable automatic red-teaming, with implications for safety defenses and the need for robust evaluation against evolving adversarial strategies in real-world deployments.
Abstract
We introduce LLMStinger, a novel approach that leverages Large Language Models (LLMs) to automatically generate adversarial suffixes for jailbreak attacks. Unlike traditional methods, which require complex prompt engineering or white-box access, LLMStinger uses a reinforcement learning (RL) loop to fine-tune an attacker LLM, generating new suffixes based on existing attacks for harmful questions from the HarmBench benchmark. Our method significantly outperforms existing red-teaming approaches (we compared against 15 of the latest methods), achieving a +57.2% improvement in Attack Success Rate (ASR) on LLaMA2-7B-chat and a +50.3% ASR increase on Claude 2, both models known for their extensive safety measures. Additionally, we achieved a 94.97% ASR on GPT-3.5 and 99.4% on Gemma-2B-it, demonstrating the robustness and adaptability of LLMStinger across open and closed-source models.
