Table of Contents
Fetching ...

ProSocialAlign: Preference Conditioned Test Time Alignment in Language Models

Somnath Banerjee, Sayan Layek, Sayantan Adak, Mykola Pechenizkiy, Animesh Mukherjee, Rima Hazra

TL;DR

ProSocialAlign introduces a test-time, parameter-efficient framework that enforces lexicographic safety by first removing harm with a directional harm vector and then steering outputs toward five prosocial attributes via a joint autoregressive reward model conditioned on user preferences. The approach uses a harm-direction subtraction (DiReg) and a PBLoRA-based Pv-Arm with gradient-conflict projection, enabling multi-attribute guidance without retraining the base language model. Empirical results across diverse safety benchmarks show state-of-the-art reductions in unsafe leakage and improved alignment to human values, with strong gains in MIP, winrates, and Pareto-front coverage. This work offers a modular, scalable pathway for context-sensitive, safe, and human-aligned generation at inference time.

Abstract

Current language model safety paradigms often fall short in emotionally charged or high-stakes settings, where refusal-only approaches may alienate users and naive compliance can amplify risk. We propose ProSocialAlign, a test-time, parameter-efficient framework that steers generation toward safe, empathetic, and value-aligned responses without retraining the base model. We formalize five human-centered objectives and cast safety as lexicographic constrained generation: first, applying hard constraints to eliminate harmful continuations; then optimizing for prosocial quality within the safe set. Our method combines (i) directional regulation, a harm-mitigation mechanism that subtracts a learned "harm vector" in parameter space, and (ii) preference-aware autoregressive reward modeling trained jointly across attributes with gradient conflict resolution, enabling fine-grained, user-controllable decoding. Empirical evaluations across five safety benchmarks demonstrate state-of-the-art performance, reducing unsafe leakage and boosting alignment to human values, with strong gains across multiple evaluation metrics. ProSocialAlign offers a robust and modular foundation for generating context-sensitive, safe, and human-aligned responses at inference time.

ProSocialAlign: Preference Conditioned Test Time Alignment in Language Models

TL;DR

ProSocialAlign introduces a test-time, parameter-efficient framework that enforces lexicographic safety by first removing harm with a directional harm vector and then steering outputs toward five prosocial attributes via a joint autoregressive reward model conditioned on user preferences. The approach uses a harm-direction subtraction (DiReg) and a PBLoRA-based Pv-Arm with gradient-conflict projection, enabling multi-attribute guidance without retraining the base language model. Empirical results across diverse safety benchmarks show state-of-the-art reductions in unsafe leakage and improved alignment to human values, with strong gains in MIP, winrates, and Pareto-front coverage. This work offers a modular, scalable pathway for context-sensitive, safe, and human-aligned generation at inference time.

Abstract

Current language model safety paradigms often fall short in emotionally charged or high-stakes settings, where refusal-only approaches may alienate users and naive compliance can amplify risk. We propose ProSocialAlign, a test-time, parameter-efficient framework that steers generation toward safe, empathetic, and value-aligned responses without retraining the base model. We formalize five human-centered objectives and cast safety as lexicographic constrained generation: first, applying hard constraints to eliminate harmful continuations; then optimizing for prosocial quality within the safe set. Our method combines (i) directional regulation, a harm-mitigation mechanism that subtracts a learned "harm vector" in parameter space, and (ii) preference-aware autoregressive reward modeling trained jointly across attributes with gradient conflict resolution, enabling fine-grained, user-controllable decoding. Empirical evaluations across five safety benchmarks demonstrate state-of-the-art performance, reducing unsafe leakage and boosting alignment to human values, with strong gains across multiple evaluation metrics. ProSocialAlign offers a robust and modular foundation for generating context-sensitive, safe, and human-aligned responses at inference time.

Paper Structure

This paper contains 29 sections, 16 equations, 4 figures, 9 tables, 1 algorithm.

Figures (4)

  • Figure 1: Training ofPv-Arm The base reward model parameters ($\theta_r$) are frozen; only the PBLoRA parameters $\delta=\{A_1,A_2,B_1,B_2,W,\zeta\}$ in $\theta_r' = \{\theta_r \cup \delta\}$ are learnt.
  • Figure 1: Winrate and attribute-wise scores.
  • Figure 2: Empirical Pareto fronts on pairs of prosocial attributes. ProSocialAlign forms the outer frontier across most trade-offs, reflecting alignment to different preference vectors rather than scalarized objectives.
  • Figure 3: Attribute scores for other datasets.