Table of Contents
Fetching ...

Refining Positive and Toxic Samples for Dual Safety Self-Alignment of LLMs with Minimal Human Interventions

Jingxin Xu, Guoshun Nan, Sheng Guan, Sicong Leng, Yilian Liu, Zixiao Wang, Yuyang Ma, Zhili Zhou, Yanzhao Hou, Xiaofeng Tao

TL;DR

PT-ALIGN introduces a safety self-alignment framework that uses minimal human input by automatically refining polarized positive and toxic samples and applying fine-grained dual instruction tuning. The method combines MLE on positive samples with token-level unlikelihood training on severely toxic negatives, guided by self-constraints and red-teaming, to decouple safety from effectiveness. Empirical results across nine open-source LLMs show substantial safety gains with little to no loss in helpfulness or general performance, and improved resistance to jailbreak attacks. The approach leverages polarized supervisory signals to enhance safety learning while maintaining practical applicability for smaller models and scalable data generation. This work suggests a viable path toward cost-efficient, robust safety alignment in LLMs through self-guided data synthesis and dual-token-level optimization.

Abstract

Recent AI agents, such as ChatGPT and LLaMA, primarily rely on instruction tuning and reinforcement learning to calibrate the output of large language models (LLMs) with human intentions, ensuring the outputs are harmless and helpful. Existing methods heavily depend on the manual annotation of high-quality positive samples, while contending with issues such as noisy labels and minimal distinctions between preferred and dispreferred response data. However, readily available toxic samples with clear safety distinctions are often filtered out, removing valuable negative references that could aid LLMs in safety alignment. In response, we propose PT-ALIGN, a novel safety self-alignment approach that minimizes human supervision by automatically refining positive and toxic samples and performing fine-grained dual instruction tuning. Positive samples are harmless responses, while toxic samples deliberately contain extremely harmful content, serving as a new supervisory signals. Specifically, we utilize LLM itself to iteratively generate and refine training instances by only exploring fewer than 50 human annotations. We then employ two losses, i.e., maximum likelihood estimation (MLE) and fine-grained unlikelihood training (UT), to jointly learn to enhance the LLM's safety. The MLE loss encourages an LLM to maximize the generation of harmless content based on positive samples. Conversely, the fine-grained UT loss guides the LLM to minimize the output of harmful words based on negative samples at the token-level, thereby guiding the model to decouple safety from effectiveness, directing it toward safer fine-tuning objectives, and increasing the likelihood of generating helpful and reliable content. Experiments on 9 popular open-source LLMs demonstrate the effectiveness of our PT-ALIGN for safety alignment, while maintaining comparable levels of helpfulness and usefulness.

Refining Positive and Toxic Samples for Dual Safety Self-Alignment of LLMs with Minimal Human Interventions

TL;DR

PT-ALIGN introduces a safety self-alignment framework that uses minimal human input by automatically refining polarized positive and toxic samples and applying fine-grained dual instruction tuning. The method combines MLE on positive samples with token-level unlikelihood training on severely toxic negatives, guided by self-constraints and red-teaming, to decouple safety from effectiveness. Empirical results across nine open-source LLMs show substantial safety gains with little to no loss in helpfulness or general performance, and improved resistance to jailbreak attacks. The approach leverages polarized supervisory signals to enhance safety learning while maintaining practical applicability for smaller models and scalable data generation. This work suggests a viable path toward cost-efficient, robust safety alignment in LLMs through self-guided data synthesis and dual-token-level optimization.

Abstract

Recent AI agents, such as ChatGPT and LLaMA, primarily rely on instruction tuning and reinforcement learning to calibrate the output of large language models (LLMs) with human intentions, ensuring the outputs are harmless and helpful. Existing methods heavily depend on the manual annotation of high-quality positive samples, while contending with issues such as noisy labels and minimal distinctions between preferred and dispreferred response data. However, readily available toxic samples with clear safety distinctions are often filtered out, removing valuable negative references that could aid LLMs in safety alignment. In response, we propose PT-ALIGN, a novel safety self-alignment approach that minimizes human supervision by automatically refining positive and toxic samples and performing fine-grained dual instruction tuning. Positive samples are harmless responses, while toxic samples deliberately contain extremely harmful content, serving as a new supervisory signals. Specifically, we utilize LLM itself to iteratively generate and refine training instances by only exploring fewer than 50 human annotations. We then employ two losses, i.e., maximum likelihood estimation (MLE) and fine-grained unlikelihood training (UT), to jointly learn to enhance the LLM's safety. The MLE loss encourages an LLM to maximize the generation of harmless content based on positive samples. Conversely, the fine-grained UT loss guides the LLM to minimize the output of harmful words based on negative samples at the token-level, thereby guiding the model to decouple safety from effectiveness, directing it toward safer fine-tuning objectives, and increasing the likelihood of generating helpful and reliable content. Experiments on 9 popular open-source LLMs demonstrate the effectiveness of our PT-ALIGN for safety alignment, while maintaining comparable levels of helpfulness and usefulness.

Paper Structure

This paper contains 33 sections, 4 equations, 10 figures, 10 tables, 2 algorithms.

Figures (10)

  • Figure 1: Some real cases of LLMs, including Vicuna-13B-Chat, LLaMA-13B, and DeepSeek-V3 liu2024deepseek, generate harmful content with malicious instructions. These LLMs are tuned with general-purpose datasets such as Alpaca taori2023stanford and ShareGPTchiang2023vicuna. However, the toxic outputs generated by these models are often discarded, hindering the model from learning from its mistakes.
  • Figure 2: Our proposed PT-ALIGN overview: An illustration of the essential pipeline in the three processes. The unaligned LLM acts both as the Red Team and the annotator.
  • Figure 3: Pipeline for synthesizing a safety alignment dataset. Steps 1 to 3: The model subdivides a large number of safety topics based on the given safety domains, then synthesizes a large volume of instructions using these topics and ten seed examples. Steps 4 and 5: The model continues from a manually written constraint to generate a complete self-constraint for use as ICL text (a similar process for negative constraints). Step 6: The self-constraint prompt and three seeds equipped with inner thoughts collectively guide the model in annotating the instructions.
  • Figure 4: The impact of the number of positive and toxic samples on our PT-ALIGN. The vertical axis represents the percentage of accuracy for the HHH Evaluation, and the horizontal axis represents the number of samples. The dashed line represents the baseline’s original performance. The green solid line illustrates the variation in the harmless metric, while the blue solid line represents changes in the helpfulness metric.
  • Figure 5: Visualization using 2D PCA: Comparison between positive & toxic samples and preference-based samples.
  • ...and 5 more figures