Don't Command, Cultivate: An Exploratory Study of System-2 Alignment
Yuhang Wang, Yuxiang Zhang, Yanxu Zhu, Xinyan Wen, Jitao Sang
TL;DR
This paper investigates System-2 Alignment to improve AI safety by cultivating deliberative reasoning rather than command-like responses. It evaluates the o1 model's safety against adversarial jailbreaking and math-encoded prompts, analyzing reasoning patterns to identify safety improvements and residual vulnerabilities. For open-source models, it studies four System-2 techniques—prompt engineering, supervised fine-tuning, direct preference optimization, and reinforcement learning—using a WildJailbreak test set and a proposed process-supervision framework. Reinforcement learning is formulated as a language-augmented Markov Decision Process $\mathcal{M} = (\mathcal{V}, \mathcal{S}, \mathcal{A}, \mathcal{T}, \mathcal{R})$, with $s_0$ the prompt $Q_i$, $s_{t+1} = \mathcal{T}(s_t, a_t)$ and final reward $\mathcal{R}$; overall, findings indicate System-2 alignment can enhance safety but requires careful balancing of not_unsafe and not_overrefuse, and future work on transparency and process supervision is warranted.
Abstract
The o1 system card identifies the o1 models as the most robust within OpenAI, with their defining characteristic being the progression from rapid, intuitive thinking to slower, more deliberate reasoning. This observation motivated us to investigate the influence of System-2 thinking patterns on model safety. In our preliminary research, we conducted safety evaluations of the o1 model, including complex jailbreak attack scenarios using adversarial natural language prompts and mathematical encoding prompts. Our findings indicate that the o1 model demonstrates relatively improved safety performance; however, it still exhibits vulnerabilities, particularly against jailbreak attacks employing mathematical encoding. Through detailed case analysis, we identified specific patterns in the o1 model's responses. We also explored the alignment of System-2 safety in open-source models using prompt engineering and supervised fine-tuning techniques. Experimental results show that some simple methods to encourage the model to carefully scrutinize user requests are beneficial for model safety. Additionally, we proposed a implementation plan for process supervision to enhance safety alignment. The implementation details and experimental results will be provided in future versions.
