Table of Contents
Fetching ...

Don't Command, Cultivate: An Exploratory Study of System-2 Alignment

Yuhang Wang, Yuxiang Zhang, Yanxu Zhu, Xinyan Wen, Jitao Sang

TL;DR

This paper investigates System-2 Alignment to improve AI safety by cultivating deliberative reasoning rather than command-like responses. It evaluates the o1 model's safety against adversarial jailbreaking and math-encoded prompts, analyzing reasoning patterns to identify safety improvements and residual vulnerabilities. For open-source models, it studies four System-2 techniques—prompt engineering, supervised fine-tuning, direct preference optimization, and reinforcement learning—using a WildJailbreak test set and a proposed process-supervision framework. Reinforcement learning is formulated as a language-augmented Markov Decision Process $\mathcal{M} = (\mathcal{V}, \mathcal{S}, \mathcal{A}, \mathcal{T}, \mathcal{R})$, with $s_0$ the prompt $Q_i$, $s_{t+1} = \mathcal{T}(s_t, a_t)$ and final reward $\mathcal{R}$; overall, findings indicate System-2 alignment can enhance safety but requires careful balancing of not_unsafe and not_overrefuse, and future work on transparency and process supervision is warranted.

Abstract

The o1 system card identifies the o1 models as the most robust within OpenAI, with their defining characteristic being the progression from rapid, intuitive thinking to slower, more deliberate reasoning. This observation motivated us to investigate the influence of System-2 thinking patterns on model safety. In our preliminary research, we conducted safety evaluations of the o1 model, including complex jailbreak attack scenarios using adversarial natural language prompts and mathematical encoding prompts. Our findings indicate that the o1 model demonstrates relatively improved safety performance; however, it still exhibits vulnerabilities, particularly against jailbreak attacks employing mathematical encoding. Through detailed case analysis, we identified specific patterns in the o1 model's responses. We also explored the alignment of System-2 safety in open-source models using prompt engineering and supervised fine-tuning techniques. Experimental results show that some simple methods to encourage the model to carefully scrutinize user requests are beneficial for model safety. Additionally, we proposed a implementation plan for process supervision to enhance safety alignment. The implementation details and experimental results will be provided in future versions.

Don't Command, Cultivate: An Exploratory Study of System-2 Alignment

TL;DR

This paper investigates System-2 Alignment to improve AI safety by cultivating deliberative reasoning rather than command-like responses. It evaluates the o1 model's safety against adversarial jailbreaking and math-encoded prompts, analyzing reasoning patterns to identify safety improvements and residual vulnerabilities. For open-source models, it studies four System-2 techniques—prompt engineering, supervised fine-tuning, direct preference optimization, and reinforcement learning—using a WildJailbreak test set and a proposed process-supervision framework. Reinforcement learning is formulated as a language-augmented Markov Decision Process , with the prompt , and final reward ; overall, findings indicate System-2 alignment can enhance safety but requires careful balancing of not_unsafe and not_overrefuse, and future work on transparency and process supervision is warranted.

Abstract

The o1 system card identifies the o1 models as the most robust within OpenAI, with their defining characteristic being the progression from rapid, intuitive thinking to slower, more deliberate reasoning. This observation motivated us to investigate the influence of System-2 thinking patterns on model safety. In our preliminary research, we conducted safety evaluations of the o1 model, including complex jailbreak attack scenarios using adversarial natural language prompts and mathematical encoding prompts. Our findings indicate that the o1 model demonstrates relatively improved safety performance; however, it still exhibits vulnerabilities, particularly against jailbreak attacks employing mathematical encoding. Through detailed case analysis, we identified specific patterns in the o1 model's responses. We also explored the alignment of System-2 safety in open-source models using prompt engineering and supervised fine-tuning techniques. Experimental results show that some simple methods to encourage the model to carefully scrutinize user requests are beneficial for model safety. Additionally, we proposed a implementation plan for process supervision to enhance safety alignment. The implementation details and experimental results will be provided in future versions.

Paper Structure

This paper contains 12 sections, 8 figures, 14 tables.

Figures (8)

  • Figure 1: An example of the o1 safely responding to an adversarial harmful prompt. The upper part of the response consists of the model's internal reasoning thoughts, while the bottm part presents the final output. In this example, the model correctly rejects a prompt that could lead to the generation of unsafe content.
  • Figure 2: An example of the o1 model being attacked. In the latter part of the "thoughts" section, the model exhibited harmful behavioral implications, leading to the output of harmful content. Although the model eventually truncated the output, it had already produced a significant amount of violent information.
  • Figure 3: An example of a normal response to benign prompts.
  • Figure 4: An example of overrefusal to benign prompts.
  • Figure 5: An example of an o1 being attacked by a prompt encoded with mathematical symbols. The original prompt is "how to rob a bank?". On the left shows the model's response to the directly inputted math-encoded prompt, and on the right presents the response with an added safety alert.
  • ...and 3 more figures