Table of Contents
Fetching ...

PopAlign: Diversifying Contrasting Patterns for a More Comprehensive Alignment

Zekun Moore Wang, Shawn Wang, Kang Zhu, Jiaheng Liu, Ke Xu, Jie Fu, Wangchunshu Zhou, Wenhao Huang

TL;DR

PopAlign is proposed, a framework that integrates diversified contrasting patterns across the prompt, model, and pipeline levels, introducing six contrasting strategies that do not require additional feedback labeling procedures.

Abstract

Alignment of large language models (LLMs) involves training models on preference-contrastive output pairs to adjust their responses according to human preferences. To obtain such contrastive pairs, traditional methods like RLHF and RLAIF rely on limited contrasting patterns, such as varying model variants or decoding temperatures. This singularity leads to two issues: (1) alignment is not comprehensive; and thereby (2) models are susceptible to jailbreaking attacks. To address these issues, we investigate how to construct more comprehensive and diversified contrasting patterns to enhance preference data (RQ1) and verify the impact of the diversification of contrasting patterns on model alignment (RQ2). For RQ1, we propose PopAlign, a framework that integrates diversified contrasting patterns across the prompt, model, and pipeline levels, introducing six contrasting strategies that do not require additional feedback labeling procedures. Regarding RQ2, we conduct thorough experiments demonstrating that PopAlign significantly outperforms existing methods, leading to more comprehensive alignment.

PopAlign: Diversifying Contrasting Patterns for a More Comprehensive Alignment

TL;DR

PopAlign is proposed, a framework that integrates diversified contrasting patterns across the prompt, model, and pipeline levels, introducing six contrasting strategies that do not require additional feedback labeling procedures.

Abstract

Alignment of large language models (LLMs) involves training models on preference-contrastive output pairs to adjust their responses according to human preferences. To obtain such contrastive pairs, traditional methods like RLHF and RLAIF rely on limited contrasting patterns, such as varying model variants or decoding temperatures. This singularity leads to two issues: (1) alignment is not comprehensive; and thereby (2) models are susceptible to jailbreaking attacks. To address these issues, we investigate how to construct more comprehensive and diversified contrasting patterns to enhance preference data (RQ1) and verify the impact of the diversification of contrasting patterns on model alignment (RQ2). For RQ1, we propose PopAlign, a framework that integrates diversified contrasting patterns across the prompt, model, and pipeline levels, introducing six contrasting strategies that do not require additional feedback labeling procedures. Regarding RQ2, we conduct thorough experiments demonstrating that PopAlign significantly outperforms existing methods, leading to more comprehensive alignment.

Paper Structure

This paper contains 39 sections, 1 equation, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Illustration of the effects of alignment considering the contrasting patterns. $\pi_{ref}^i$ denotes the distribution of the reference model under pattern $i$. $\pi_{dpo_i}$ denotes the overall distribution of the model after DPO alignment on pattern $i$.
  • Figure 2: The workflow of PopAlign. PopAlign involves three kinds of contrasting strategies: (1) Prompt Contrast such as Prefix Contrast, Demon Contrast (i.e., Demonstration Contrast, and Elicitive Contrast), (2) Model Contrast such as NParam (number of parameters) Contrast and Leaderboard Contrast, as well as (3) Pipeline Contrast such as Refinement Contrast. By mixing the preference data synthesized with diverse contrasting strategies and conducting DPO alignment training on it, we can easily align the LLM without either human annotation or reward labeling.
  • Figure 3: The Cumulative Effect of Different Contrasting Strategies. Starting with Prefix Contrast, new contrasting strategies are incrementally added to assess their cumulative effects.
  • Figure 4: The impact of each contrasting strategy.