SPA: Achieving Consensus in LLM Alignment via Self-Priority Optimization
Yue Huang, Xiangqi Wang, Xiangliang Zhang
TL;DR
This work introduces Priority Alignment, a lexicographic framework that enforces a primary safety objective $G_a(\theta)$ to meet a threshold before optimizing a secondary objective $G_b(\theta)$ such as helpfulness. It then presents Self-Priority Alignment (SPA), a fully unsupervised pipeline that generates diverse responses, self-evaluates them on both objectives, denoises the data, constructs lexicographic preference pairs, and updates the model with an uncertainty-weighted preference loss. SPA's three core components are (i) diverse sampling with self-refinement, (ii) dual-criterion denoising to filter unreliable signals, and (iii) lexicographic preference optimization via Uncertainty-Guided SimPO, enabling Pareto-informed, priority-respecting fine-tuning. Empirical results across multiple LLMs and high-stakes benchmarks show that SPA improves harmlessness and helpfulness while preserving general capabilities, outperforming baselines and demonstrating scalable, interpretable alignment for critical applications. The work also provides theoretical connections between lexicographic ordering and the utility function used in Bradley–Terrry-style models, supporting the method’s principled prioritization of safety in optimization.
Abstract
In high-stakes scenarios-such as self-harm, legal, or medical queries-LLMs must be both trustworthy and helpful. However, these goals often conflict. We propose priority alignment, a new alignment paradigm that enforces a strict "trustworthy-before-helpful" ordering: optimization of helpfulness is conditioned on first meeting trustworthy thresholds (e.g., harmlessness or honesty). To realize this, we introduce Self-Priority Alignment (SPA)-a fully unsupervised framework that generates diverse responses, self-evaluates them and refines them by the model itself, and applies dual-criterion denoising to remove inconsistency and control variance. From this, SPA constructs lexicographically ordered preference pairs and fine-tunes the model using an uncertainty-weighted alignment loss that emphasizes high-confidence, high-gap decisions. Experiments across multiple benchmarks show that SPA improves helpfulness without compromising safety, outperforming strong baselines while preserving general capabilities. Our results demonstrate that SPA provides a scalable and interpretable alignment strategy for critical LLM applications.
