Table of Contents
Fetching ...

P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling

Pinyi Zhang, Ting-En Lin, Yuchuan Wu, Jingyang Chen, Zongqi Wang, Hua Yang, Ze Xu, Fei Huang, Kai Zhang, Yongbin Li

TL;DR

P-GenRM is proposed, the first Personalized Generative Reward Model with test-time user-based scaling, which transforms preference signals into structured evaluation chains that derive adaptive personas and scoring rubrics across various scenarios and introduces a dual-granularity scaling mechanism.

Abstract

Personalized alignment of large language models seeks to adapt responses to individual user preferences, typically via reinforcement learning. A key challenge is obtaining accurate, user-specific reward signals in open-ended scenarios. Existing personalized reward models face two persistent limitations: (1) oversimplifying diverse, scenario-specific preferences into a small, fixed set of evaluation principles, and (2) struggling with generalization to new users with limited feedback. To this end, we propose P-GenRM, the first Personalized Generative Reward Model with test-time user-based scaling. P-GenRM transforms preference signals into structured evaluation chains that derive adaptive personas and scoring rubrics across various scenarios. It further clusters users into User Prototypes and introduces a dual-granularity scaling mechanism: at the individual level, it adaptively scales and aggregates each user's scoring scheme; at the prototype level, it incorporates preferences from similar users. This design mitigates noise in inferred preferences and enhances generalization to unseen users through prototype-based transfer. Empirical results show that P-GenRM achieves state-of-the-art results on widely-used personalized reward model benchmarks, with an average improvement of 2.31%, and demonstrates strong generalization on an out-of-distribution dataset. Notably, Test-time User-based scaling provides an additional 3% boost, demonstrating stronger personalized alignment with test-time scalability.

P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling

TL;DR

P-GenRM is proposed, the first Personalized Generative Reward Model with test-time user-based scaling, which transforms preference signals into structured evaluation chains that derive adaptive personas and scoring rubrics across various scenarios and introduces a dual-granularity scaling mechanism.

Abstract

Personalized alignment of large language models seeks to adapt responses to individual user preferences, typically via reinforcement learning. A key challenge is obtaining accurate, user-specific reward signals in open-ended scenarios. Existing personalized reward models face two persistent limitations: (1) oversimplifying diverse, scenario-specific preferences into a small, fixed set of evaluation principles, and (2) struggling with generalization to new users with limited feedback. To this end, we propose P-GenRM, the first Personalized Generative Reward Model with test-time user-based scaling. P-GenRM transforms preference signals into structured evaluation chains that derive adaptive personas and scoring rubrics across various scenarios. It further clusters users into User Prototypes and introduces a dual-granularity scaling mechanism: at the individual level, it adaptively scales and aggregates each user's scoring scheme; at the prototype level, it incorporates preferences from similar users. This design mitigates noise in inferred preferences and enhances generalization to unseen users through prototype-based transfer. Empirical results show that P-GenRM achieves state-of-the-art results on widely-used personalized reward model benchmarks, with an average improvement of 2.31%, and demonstrates strong generalization on an out-of-distribution dataset. Notably, Test-time User-based scaling provides an additional 3% boost, demonstrating stronger personalized alignment with test-time scalability.
Paper Structure (33 sections, 9 equations, 15 figures, 15 tables)

This paper contains 33 sections, 9 equations, 15 figures, 15 tables.

Figures (15)

  • Figure 1: Workflow of P-GenRM. P-GenRM infers a scenario-specific user persona and preference analysis from hybrid preference signals, generates dynamic scoring rubrics, and assesses candidate responses accordingly. At test-time, P-GenRM can aggregate multiple individual-level scoring schemes and incorporate similar users’ preferences to improve scoring accuracy and generalization.
  • Figure 2: (a) The three-stage training framework of P-GenRM (b) An illustration of the personalized evaluation chain, showing how preference modeling and derived scoring schemes lead to interpretable, criterion-weighted judgments for responses.
  • Figure 3: Determination of prototype numbers and their effect on scaling performance. Left: retained variance ratio as a function of the number of singular vectors on Chatbot Arena and PRISM. Right: performance of P-GenRM with different prototype numbers.
  • Figure 4: Visualization of User–prototype distributions and representative preference patterns. Blue highlights show shared intra-group preferences, red highlights show individual diversity. Distinct clusters capture inter-group heterogeneity, validating prototype-based modeling.
  • Figure 5: The number of samples assigned to each prototype and the corresponding performance of P-GenRM across them.
  • ...and 10 more figures