Table of Contents
Fetching ...

Generalization Limits of Reinforcement Learning Alignment

Haruhi Shida, Koo Imai, Keigo Kansa

Abstract

The safety of large language models (LLMs) relies on alignment techniques such as reinforcement learning from human feedback (RLHF). However, recent theoretical analyses suggest that reinforcement learning-based training does not acquire new capabilities but merely redistributes the utilization probabilities of existing ones. In this study, we propose ``compound jailbreaks'' targeting OpenAI gpt-oss-20b, which exploit the generalization failures of alignment. This approach combines multiple attack techniques -- each individually defended against -- to saturate the instruction hierarchy maintenance process. Our evaluation shows that the attack success rate (ASR) increased from 14.3\% with individual methods to 71.4\% with the combined approach. These results provide empirical evidence for the hypothesis that safety training does not generalize as broadly as model capabilities, highlighting the need for multifaceted safety evaluations using compound attack scenarios.

Generalization Limits of Reinforcement Learning Alignment

Abstract

The safety of large language models (LLMs) relies on alignment techniques such as reinforcement learning from human feedback (RLHF). However, recent theoretical analyses suggest that reinforcement learning-based training does not acquire new capabilities but merely redistributes the utilization probabilities of existing ones. In this study, we propose ``compound jailbreaks'' targeting OpenAI gpt-oss-20b, which exploit the generalization failures of alignment. This approach combines multiple attack techniques -- each individually defended against -- to saturate the instruction hierarchy maintenance process. Our evaluation shows that the attack success rate (ASR) increased from 14.3\% with individual methods to 71.4\% with the combined approach. These results provide empirical evidence for the hypothesis that safety training does not generalize as broadly as model capabilities, highlighting the need for multifaceted safety evaluations using compound attack scenarios.

Paper Structure

This paper contains 22 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Compound Jailbreak framework. Three attack elements (contrastive structure, authoritative persona, self-assessment demand) are combined to saturate cognitive resources and bypass safety mechanisms.
  • Figure 2: Relationship between the number of combined attack elements and attack success rate. The horizontal axis shows the number of combined attack elements (1: individual method, 2: two-element combination, 3: all elements combined), and the vertical axis shows the attack success rate. An increasing trend in ASR was observed with the number of elements.