Table of Contents
Fetching ...

Aligning Deep Implicit Preferences by Learning to Reason Defensively

Peiming Li, Zhiyuan Hu, Yang Tang, Shiyu Li, Xi Chen

TL;DR

The paper addresses the problem of aligning LLMs with users who have deep implicit preferences and ambiguous real-world contexts. It introduces Critique-Driven Reasoning Alignment (CDRA), a process-centric framework built on the DeepPref dataset, Pers-GenPRM reward modeling, and Critique-Driven Policy Alignment (CDPA). By training models to reason through latent user intents with explicit critiques and step-wise rewards, CDRA achieves state-of-the-art performance in deep preference understanding and defensive reasoning while maintaining adherence to explicit user instructions. The approach yields interpretable, robust alignment signals and demonstrates improvements across both dataset benchmarks and real-world reasoning tasks, with reproducibility resources and ethical safeguards discussed. This work advances personalized AI by moving from surface-level mimicry to defensible, cognitively grounded alignment that can better handle ambiguity and risk in user interactions.

Abstract

Personalized alignment is crucial for enabling Large Language Models (LLMs) to engage effectively in user-centric interactions. However, current methods face a dual challenge: they fail to infer users' deep implicit preferences (including unstated goals, semantic context and risk tolerances), and they lack the defensive reasoning required to navigate real-world ambiguity. This cognitive gap leads to responses that are superficial, brittle and short-sighted. To address this, we propose Critique-Driven Reasoning Alignment (CDRA), which reframes alignment from a scalar reward-matching task into a structured reasoning process. First, to bridge the preference inference gap, we introduce the DeepPref benchmark. This dataset, comprising 3000 preference-query pairs across 20 topics, is curated by simulating a multi-faceted cognitive council that produces critique-annotated reasoning chains to deconstruct query semantics and reveal latent risks. Second, to instill defensive reasoning, we introduce the Personalized Generative Process Reward Model (Pers-GenPRM), which frames reward modeling as a personalized reasoning task. It generates a critique chain to evaluate a response's alignment with user preferences before outputting a final score based on this rationale. Ultimately, this interpretable, structured reward signal guides policy model through Critique-Driven Policy Alignment, a process-level online reinforcement learning algorithm integrating both numerical and natural language feedback. Experiments demonstrate that CDRA excels at discovering and aligning with users' true preferences while executing robust reasoning. Our code and dataset are available at https://github.com/Zephyrian-Hugh/Deep-pref.

Aligning Deep Implicit Preferences by Learning to Reason Defensively

TL;DR

The paper addresses the problem of aligning LLMs with users who have deep implicit preferences and ambiguous real-world contexts. It introduces Critique-Driven Reasoning Alignment (CDRA), a process-centric framework built on the DeepPref dataset, Pers-GenPRM reward modeling, and Critique-Driven Policy Alignment (CDPA). By training models to reason through latent user intents with explicit critiques and step-wise rewards, CDRA achieves state-of-the-art performance in deep preference understanding and defensive reasoning while maintaining adherence to explicit user instructions. The approach yields interpretable, robust alignment signals and demonstrates improvements across both dataset benchmarks and real-world reasoning tasks, with reproducibility resources and ethical safeguards discussed. This work advances personalized AI by moving from surface-level mimicry to defensible, cognitively grounded alignment that can better handle ambiguity and risk in user interactions.

Abstract

Personalized alignment is crucial for enabling Large Language Models (LLMs) to engage effectively in user-centric interactions. However, current methods face a dual challenge: they fail to infer users' deep implicit preferences (including unstated goals, semantic context and risk tolerances), and they lack the defensive reasoning required to navigate real-world ambiguity. This cognitive gap leads to responses that are superficial, brittle and short-sighted. To address this, we propose Critique-Driven Reasoning Alignment (CDRA), which reframes alignment from a scalar reward-matching task into a structured reasoning process. First, to bridge the preference inference gap, we introduce the DeepPref benchmark. This dataset, comprising 3000 preference-query pairs across 20 topics, is curated by simulating a multi-faceted cognitive council that produces critique-annotated reasoning chains to deconstruct query semantics and reveal latent risks. Second, to instill defensive reasoning, we introduce the Personalized Generative Process Reward Model (Pers-GenPRM), which frames reward modeling as a personalized reasoning task. It generates a critique chain to evaluate a response's alignment with user preferences before outputting a final score based on this rationale. Ultimately, this interpretable, structured reward signal guides policy model through Critique-Driven Policy Alignment, a process-level online reinforcement learning algorithm integrating both numerical and natural language feedback. Experiments demonstrate that CDRA excels at discovering and aligning with users' true preferences while executing robust reasoning. Our code and dataset are available at https://github.com/Zephyrian-Hugh/Deep-pref.

Paper Structure

This paper contains 34 sections, 8 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: (a) Problem Formulation: Optimizing for outcomes rather than the reasoning process creates the dual preference and process gaps. (b) Comparison of Alignment Paradigms: Standard, outcome-based approaches (left) exemplify the problem of superficial preference matching. In contrast, our CDRA (right), shifts the paradigm to be process-driven and explicitly bridges both gaps.
  • Figure 2: Overview of the CDRA Framework. The process consists of three main stages: (1) DeepPref Dataset Construction; (2) Personalized Reward Modeling; and (3) Critique-Driven Policy Alignment. (2) and (3) are illustrated in detail in Figure \ref{['fig:framework']}.
  • Figure 3: Personalized Reward Modeling (Section \ref{['sec:3.1.2']}): Pers-GenPRM generates a reflective chain of critiques based on whether each step of a response infers the user's deep implicit preferences and proactively mitigates potential risks. It then derives step-wise reward scores from these critiques. Critique-Driven Policy Alignment (Section \ref{['sec:3.1.3']}): The policy model is first aligned using Rejection-sampling Fine-Tuning. Subsequently, it incorporates the process-level supervision rewards from Pers-GenPRM into its reward signal for further alignment.
  • Figure 4: Comprehensive performance comparison. Our CDRA (shown in orange) achieves the largest coverage area on the radar chart, signifying its dominant and well-rounded performance across all evaluation dimensions. It establishes a new state-of-the-art in deep preference understanding and defensive reasoning, while also maintaining top-tier accuracy in explicit preference following. For all axes, a higher value (further from the center) indicates better performance. Error-based metrics are inverted for consistent visualization.
  • Figure 5: A qualitative comparison showing CDRA reasoning about latent intent.
  • ...and 10 more figures