QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA
Jacob Dineen, Aswin RRV, Qin Liu, Zhikun Xu, Xiao Ye, Ming Shen, Zhaonan Li, Shijie Lu, Chitta Baral, Muhao Chen, Ben Zhou
TL;DR
QA-lign replaces opaque scalar rewards with interpretable, principle-based evaluation programs that decompose alignment into Harmlessness, Honesty, and Helpfulness. Through a draft–reflect–revise pipeline and GRPO, the method yields transparent feedback and multi-axis rewards that are aggregated into a single training signal. Empirically, QA-lign achieves Pareto-optimal safety–helpfulness tradeoffs, substantially reducing attack success rates while maintaining low false-refusal rates and preserving reasoning capabilities, outperforming both DPO and GRPO with comparable training. The work demonstrates that interpretability and modular reward design can enhance alignment effectiveness without sacrificing performance, offering a practical path toward safer and more controllable LLMs.
Abstract
Alignment of large language models (LLMs) with principles like helpfulness, honesty, and harmlessness typically relies on scalar rewards that obscure which objectives drive the training signal. We introduce QA-LIGN, which decomposes monolithic rewards into interpretable principle-specific evaluations through structured natural language programs. Models learn through a draft, critique, and revise pipeline, where symbolic evaluation against the rubrics provides transparent feedback for both initial and revised responses during GRPO training. Applied to uncensored Llama-3.1-8B-Instruct, QA-LIGN reduces attack success rates by up to 68.7% while maintaining a 0.67% false refusal rate, achieving Pareto optimal safety-helpfulness performance and outperforming both DPO and GRPO with state-of-the-art reward models given equivalent training. These results demonstrate that making reward signals interpretable and modular improves alignment effectiveness, suggesting transparency enhances LLM safety.
