Table of Contents
Fetching ...

Can Reinforcement Learning Unlock the Hidden Dangers in Aligned Large Language Models?

Mohammad Bahrami Karkevandi, Nishant Vishwamitra, Peyman Najafirad

TL;DR

The paper investigates the persistence of harmful outputs from aligned LLMs and the difficulty of jailbreaking black-box models. It introduces a reinforcement-learning framework that optimizes adversarial triggers using only inference API access and a small surrogate model, guided by a BERTScore-based reward. Key contributions include a two-phase training protocol, a BERTScore-based objective, and demonstrated transferability of triggers to a previously untested black-box model (Mistral) with improved attack success rates. The work highlights persistent vulnerabilities in LLM safety and motivates development of stronger defenses, safety monitoring, and robust prompting strategies for real-world deployments.

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities in natural language tasks, but their safety and morality remain contentious due to their training on internet text corpora. To address these concerns, alignment techniques have been developed to improve the public usability and safety of LLMs. Yet, the potential for generating harmful content through these models seems to persist. This paper explores the concept of jailbreaking LLMs-reversing their alignment through adversarial triggers. Previous methods, such as soft embedding prompts, manually crafted prompts, and gradient-based automatic prompts, have had limited success on black-box models due to their requirements for model access and for producing a low variety of manually crafted prompts, making them susceptible to being blocked. This paper introduces a novel approach using reinforcement learning to optimize adversarial triggers, requiring only inference API access to the target model and a small surrogate model. Our method, which leverages a BERTScore-based reward function, enhances the transferability and effectiveness of adversarial triggers on new black-box models. We demonstrate that this approach improves the performance of adversarial triggers on a previously untested language model.

Can Reinforcement Learning Unlock the Hidden Dangers in Aligned Large Language Models?

TL;DR

The paper investigates the persistence of harmful outputs from aligned LLMs and the difficulty of jailbreaking black-box models. It introduces a reinforcement-learning framework that optimizes adversarial triggers using only inference API access and a small surrogate model, guided by a BERTScore-based reward. Key contributions include a two-phase training protocol, a BERTScore-based objective, and demonstrated transferability of triggers to a previously untested black-box model (Mistral) with improved attack success rates. The work highlights persistent vulnerabilities in LLM safety and motivates development of stronger defenses, safety monitoring, and robust prompting strategies for real-world deployments.

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities in natural language tasks, but their safety and morality remain contentious due to their training on internet text corpora. To address these concerns, alignment techniques have been developed to improve the public usability and safety of LLMs. Yet, the potential for generating harmful content through these models seems to persist. This paper explores the concept of jailbreaking LLMs-reversing their alignment through adversarial triggers. Previous methods, such as soft embedding prompts, manually crafted prompts, and gradient-based automatic prompts, have had limited success on black-box models due to their requirements for model access and for producing a low variety of manually crafted prompts, making them susceptible to being blocked. This paper introduces a novel approach using reinforcement learning to optimize adversarial triggers, requiring only inference API access to the target model and a small surrogate model. Our method, which leverages a BERTScore-based reward function, enhances the transferability and effectiveness of adversarial triggers on new black-box models. We demonstrate that this approach improves the performance of adversarial triggers on a previously untested language model.
Paper Structure (18 sections, 4 equations, 1 figure, 1 table)

This paper contains 18 sections, 4 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: Overall architecture of our method. The surrogate model is already initialized in a supervised fine-tuning setup and is further fine-tuned to the target model with the reward signals. BERTScore is used as the Semantic Similarity Function to compare the resulting generation of the current adversarial trigger with the desired target output and rewards the surrogate model.