Table of Contents
Fetching ...

QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA

Jacob Dineen, Aswin RRV, Qin Liu, Zhikun Xu, Xiao Ye, Ming Shen, Zhaonan Li, Shijie Lu, Chitta Baral, Muhao Chen, Ben Zhou

TL;DR

QA-lign replaces opaque scalar rewards with interpretable, principle-based evaluation programs that decompose alignment into Harmlessness, Honesty, and Helpfulness. Through a draft–reflect–revise pipeline and GRPO, the method yields transparent feedback and multi-axis rewards that are aggregated into a single training signal. Empirically, QA-lign achieves Pareto-optimal safety–helpfulness tradeoffs, substantially reducing attack success rates while maintaining low false-refusal rates and preserving reasoning capabilities, outperforming both DPO and GRPO with comparable training. The work demonstrates that interpretability and modular reward design can enhance alignment effectiveness without sacrificing performance, offering a practical path toward safer and more controllable LLMs.

Abstract

Alignment of large language models (LLMs) with principles like helpfulness, honesty, and harmlessness typically relies on scalar rewards that obscure which objectives drive the training signal. We introduce QA-LIGN, which decomposes monolithic rewards into interpretable principle-specific evaluations through structured natural language programs. Models learn through a draft, critique, and revise pipeline, where symbolic evaluation against the rubrics provides transparent feedback for both initial and revised responses during GRPO training. Applied to uncensored Llama-3.1-8B-Instruct, QA-LIGN reduces attack success rates by up to 68.7% while maintaining a 0.67% false refusal rate, achieving Pareto optimal safety-helpfulness performance and outperforming both DPO and GRPO with state-of-the-art reward models given equivalent training. These results demonstrate that making reward signals interpretable and modular improves alignment effectiveness, suggesting transparency enhances LLM safety.

QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA

TL;DR

QA-lign replaces opaque scalar rewards with interpretable, principle-based evaluation programs that decompose alignment into Harmlessness, Honesty, and Helpfulness. Through a draft–reflect–revise pipeline and GRPO, the method yields transparent feedback and multi-axis rewards that are aggregated into a single training signal. Empirically, QA-lign achieves Pareto-optimal safety–helpfulness tradeoffs, substantially reducing attack success rates while maintaining low false-refusal rates and preserving reasoning capabilities, outperforming both DPO and GRPO with comparable training. The work demonstrates that interpretability and modular reward design can enhance alignment effectiveness without sacrificing performance, offering a practical path toward safer and more controllable LLMs.

Abstract

Alignment of large language models (LLMs) with principles like helpfulness, honesty, and harmlessness typically relies on scalar rewards that obscure which objectives drive the training signal. We introduce QA-LIGN, which decomposes monolithic rewards into interpretable principle-specific evaluations through structured natural language programs. Models learn through a draft, critique, and revise pipeline, where symbolic evaluation against the rubrics provides transparent feedback for both initial and revised responses during GRPO training. Applied to uncensored Llama-3.1-8B-Instruct, QA-LIGN reduces attack success rates by up to 68.7% while maintaining a 0.67% false refusal rate, achieving Pareto optimal safety-helpfulness performance and outperforming both DPO and GRPO with state-of-the-art reward models given equivalent training. These results demonstrate that making reward signals interpretable and modular improves alignment effectiveness, suggesting transparency enhances LLM safety.

Paper Structure

This paper contains 41 sections, 4 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: QA-lign uses a hierarchical evaluation framework with three principles (Harmlessness, Honesty, Helpfulness). Each sub-question above is positively framed, so True denotes ideal behavior under a specific query.
  • Figure 2: RLHF vs. RLAIF workflows. Top: traditional RLHF uses human annotations to train the reward model for policy optimization. Bottom: RLAIF replaces human labels with AI-generated ratings to bootstrap the reward model.
  • Figure 3: The three‑stage QA-lign training process. First, a strong LLM is prompted with a constitution $\mathcal{P}$ containing alignment principles to produce a hierarchically structured evaluation program $\mathcal{Q}$ with gated binary and graded questions. Next, we perform SFT via demonstrations of the form $(x, y^{\text{draft}}, \texttt{<Think>} \; t, y^{\text{revision}})$: The model generates a draft response, receives a rubric‑guided critique from fixed judge $J$ executing $\mathcal{Q}$, and then creates a revision from scratch. Finally, the model is trained with RL using GRPO. In this stage, the model is rewarded for producing revisions that improve upon the initial draft, as measured by applying $\mathcal{Q}$ to evaluate both $y^{\text{draft}}$ and $y^{\text{revision}}$ separately through hierarchical pooling into principle scores.
  • Figure 4: We experiment with a program spanning 3 principles, 40 dimensions, and 167 questions. 42 of the questions act as True/False binary gates to graded questions (program blocks are semantically composed together by a strong LLM), which are asked to be rated on a letter-grade scale of A–F.
  • Figure 5: Stage-2 "Think" SFT example. The model drafts an unsafe answer, which QA-lign evaluates using principle-specific Q&A programs. Based on the evaluation, QA-lign generates a <Think> reflection that guides the model to revise its response safely.
  • ...and 5 more figures