Table of Contents
Fetching ...

SPA: Achieving Consensus in LLM Alignment via Self-Priority Optimization

Yue Huang, Xiangqi Wang, Xiangliang Zhang

TL;DR

This work introduces Priority Alignment, a lexicographic framework that enforces a primary safety objective $G_a(\theta)$ to meet a threshold before optimizing a secondary objective $G_b(\theta)$ such as helpfulness. It then presents Self-Priority Alignment (SPA), a fully unsupervised pipeline that generates diverse responses, self-evaluates them on both objectives, denoises the data, constructs lexicographic preference pairs, and updates the model with an uncertainty-weighted preference loss. SPA's three core components are (i) diverse sampling with self-refinement, (ii) dual-criterion denoising to filter unreliable signals, and (iii) lexicographic preference optimization via Uncertainty-Guided SimPO, enabling Pareto-informed, priority-respecting fine-tuning. Empirical results across multiple LLMs and high-stakes benchmarks show that SPA improves harmlessness and helpfulness while preserving general capabilities, outperforming baselines and demonstrating scalable, interpretable alignment for critical applications. The work also provides theoretical connections between lexicographic ordering and the utility function used in Bradley–Terrry-style models, supporting the method’s principled prioritization of safety in optimization.

Abstract

In high-stakes scenarios-such as self-harm, legal, or medical queries-LLMs must be both trustworthy and helpful. However, these goals often conflict. We propose priority alignment, a new alignment paradigm that enforces a strict "trustworthy-before-helpful" ordering: optimization of helpfulness is conditioned on first meeting trustworthy thresholds (e.g., harmlessness or honesty). To realize this, we introduce Self-Priority Alignment (SPA)-a fully unsupervised framework that generates diverse responses, self-evaluates them and refines them by the model itself, and applies dual-criterion denoising to remove inconsistency and control variance. From this, SPA constructs lexicographically ordered preference pairs and fine-tunes the model using an uncertainty-weighted alignment loss that emphasizes high-confidence, high-gap decisions. Experiments across multiple benchmarks show that SPA improves helpfulness without compromising safety, outperforming strong baselines while preserving general capabilities. Our results demonstrate that SPA provides a scalable and interpretable alignment strategy for critical LLM applications.

SPA: Achieving Consensus in LLM Alignment via Self-Priority Optimization

TL;DR

This work introduces Priority Alignment, a lexicographic framework that enforces a primary safety objective to meet a threshold before optimizing a secondary objective such as helpfulness. It then presents Self-Priority Alignment (SPA), a fully unsupervised pipeline that generates diverse responses, self-evaluates them on both objectives, denoises the data, constructs lexicographic preference pairs, and updates the model with an uncertainty-weighted preference loss. SPA's three core components are (i) diverse sampling with self-refinement, (ii) dual-criterion denoising to filter unreliable signals, and (iii) lexicographic preference optimization via Uncertainty-Guided SimPO, enabling Pareto-informed, priority-respecting fine-tuning. Empirical results across multiple LLMs and high-stakes benchmarks show that SPA improves harmlessness and helpfulness while preserving general capabilities, outperforming baselines and demonstrating scalable, interpretable alignment for critical applications. The work also provides theoretical connections between lexicographic ordering and the utility function used in Bradley–Terrry-style models, supporting the method’s principled prioritization of safety in optimization.

Abstract

In high-stakes scenarios-such as self-harm, legal, or medical queries-LLMs must be both trustworthy and helpful. However, these goals often conflict. We propose priority alignment, a new alignment paradigm that enforces a strict "trustworthy-before-helpful" ordering: optimization of helpfulness is conditioned on first meeting trustworthy thresholds (e.g., harmlessness or honesty). To realize this, we introduce Self-Priority Alignment (SPA)-a fully unsupervised framework that generates diverse responses, self-evaluates them and refines them by the model itself, and applies dual-criterion denoising to remove inconsistency and control variance. From this, SPA constructs lexicographically ordered preference pairs and fine-tunes the model using an uncertainty-weighted alignment loss that emphasizes high-confidence, high-gap decisions. Experiments across multiple benchmarks show that SPA improves helpfulness without compromising safety, outperforming strong baselines while preserving general capabilities. Our results demonstrate that SPA provides a scalable and interpretable alignment strategy for critical LLM applications.

Paper Structure

This paper contains 20 sections, 3 theorems, 27 equations, 10 figures, 7 tables, 1 algorithm.

Key Result

Lemma 1

If $\lambda > \frac{2M}{\min\{G_a(y_1) - G_a(y_2) \mid G_a(y_1) > G_a(y_2)\}}$, then for all $y_1, y_2 \in \mathcal{Y}$,

Figures (10)

  • Figure 1: Examples of achieving trustworthiness and helpfulness under high-stakes scenarios.
  • Figure 2: Overview of SPA, consisting of three components: diverse sampling with self-refinement, dual-criterion denoising, and priority alignment.
  • Figure 3: Effect of sample score variance (from low to high) on weak-strong model alignment (RV coefficient).
  • Figure 4: Results of pairwise comparison on different datasets. We use GPT-4o as the judge model.
  • Figure 5: Effect of multiple SPA iterations on WildGuard using LLaMA-3.1-8B-Instruct. "Iter 2 (diff.)" uses a new dataset in the second iteration, while "Iter 2 (same)" reuses the original data.
  • ...and 5 more figures

Theorems & Definitions (6)

  • Lemma 1
  • proof : Proof for Lemma \ref{['lemma:1']}
  • Theorem 1: Optimal Strategy under the Bradley-Terry Model
  • proof : Proof for \ref{['theo:BT_optimality']}
  • Theorem 2: Equivalence of Supervised Loss Minimizer and Bradley-Terry Policy
  • proof : Proof for \ref{['theo:SimPO_equivalence']}