Table of Contents
Fetching ...

Dual Optimal: Make Your LLM Peer-like with Dignity

Xiangqi Wang, Yue Huang, Haomin Zhuang, Kehan Guo, Xiangliang Zhang

Abstract

Current aligned language models exhibit a dual failure mode we term the Evasive Servant: they sycophantically validate flawed user beliefs while deflecting responsibility with boilerplate disclaimers. We propose the Dignified Peer framework, which counters servility with anti-sycophancy and trustworthiness, and mitigates evasiveness through empathy and creativity. Realizing this agent requires overcoming significant challenges in data supervision, objective collapse, and evaluation bias. We address these issues by introducing the PersonaKnob dataset which features a compositional partial order structure of multiple persona preference. This data is utilized alongside a tolerant constrained Lagrangian DPO algorithm that dynamically balances all persona dimensions to prevent behavioral collapse. Additionally, we employ a psychometrically calibrated Item Response Theory evaluation protocol to disentangle latent model persona capability from confounders like judge biases. Extensive empirical studies demonstrate that our approach successfully build a LLM agent with both dignity and peer.

Dual Optimal: Make Your LLM Peer-like with Dignity

Abstract

Current aligned language models exhibit a dual failure mode we term the Evasive Servant: they sycophantically validate flawed user beliefs while deflecting responsibility with boilerplate disclaimers. We propose the Dignified Peer framework, which counters servility with anti-sycophancy and trustworthiness, and mitigates evasiveness through empathy and creativity. Realizing this agent requires overcoming significant challenges in data supervision, objective collapse, and evaluation bias. We address these issues by introducing the PersonaKnob dataset which features a compositional partial order structure of multiple persona preference. This data is utilized alongside a tolerant constrained Lagrangian DPO algorithm that dynamically balances all persona dimensions to prevent behavioral collapse. Additionally, we employ a psychometrically calibrated Item Response Theory evaluation protocol to disentangle latent model persona capability from confounders like judge biases. Extensive empirical studies demonstrate that our approach successfully build a LLM agent with both dignity and peer.

Paper Structure

This paper contains 50 sections, 2 theorems, 17 equations, 19 figures, 11 tables, 1 algorithm.

Key Result

Proposition F.1

The squared gradient norm of Simple DPO is: $\blacktriangleleft$$\blacktriangleleft$

Figures (19)

  • Figure 1: Failure vs. desired behavior in one example scenario (more in \ref{['fig:model_cards']}).
  • Figure 2: Construction pipeline of PersonaKnob: from masked sampling to final human-in-the-loop review.
  • Figure 3: An illustrative instance of PersonaKnob, consisting of one reference (golden) response and three negative (rejected) responses generated under the active traits {A,E,C}. Note that the displayed responses are abstracted from the full original outputs, as they are too long to include.
  • Figure 4: Distribution statistics of PersonaKnob.
  • Figure 5: The three-stage evaluation based on MFRM. The process transitions from multiple-judge rubric annotation to multi-facet calibration, ultimately synthesizing metrics equipped with Fisher Uncertainty, resulting in Peer$\hat{P}$ and Dignity$\hat{D}$ scores.
  • ...and 14 more figures

Theorems & Definitions (4)

  • Proposition F.1: Gradient norm under uniform averaging
  • proof
  • Proposition F.2: Gradient norm under Lagrangian reweighting
  • proof