Table of Contents
Fetching ...

LoRe: Personalizing LLMs via Low-Rank Reward Modeling

Avinandan Bose, Zhihan Xiong, Yuejie Chi, Simon Shaolei Du, Lin Xiao, Maryam Fazel

TL;DR

LoRe tackles the challenge of personalizing LLM alignment when user preferences vary widely by moving from monolithic reward models to a low-rank reward basis. It learns a shared $B$-dimensional basis $\mathbf{R}_\phi$ and per-user weights $\mathbf{w}_i\in\Delta^{B-1}$, enabling efficient, few-shot adaptation to unseen users. The approach leverages collaborative ranking and a BT-style likelihood to train the basis and weights, and extends naturally to steerable multi-objective alignment for personalized generation. Empirical results on diverse datasets demonstrate superior unseen-user generalization and parameter efficiency compared to baselines like BT, VPL, and PAL, highlighting LoRe's scalability and practicality for real-world deployment.

Abstract

Personalizing large language models (LLMs) to accommodate diverse user preferences is essential for enhancing alignment and user satisfaction. Traditional reinforcement learning from human feedback (RLHF) approaches often rely on monolithic value representations, limiting their ability to adapt to individual preferences. We introduce a novel framework that leverages low-rank preference modeling to efficiently learn and generalize user-specific reward functions. By representing reward functions in a low-dimensional subspace and modeling individual preferences as weighted combinations of shared basis functions, our approach avoids rigid user categorization while enabling scalability and few-shot adaptation. We validate our method on multiple preference datasets, demonstrating superior generalization to unseen users and improved accuracy in preference prediction tasks.

LoRe: Personalizing LLMs via Low-Rank Reward Modeling

TL;DR

LoRe tackles the challenge of personalizing LLM alignment when user preferences vary widely by moving from monolithic reward models to a low-rank reward basis. It learns a shared -dimensional basis and per-user weights , enabling efficient, few-shot adaptation to unseen users. The approach leverages collaborative ranking and a BT-style likelihood to train the basis and weights, and extends naturally to steerable multi-objective alignment for personalized generation. Empirical results on diverse datasets demonstrate superior unseen-user generalization and parameter efficiency compared to baselines like BT, VPL, and PAL, highlighting LoRe's scalability and practicality for real-world deployment.

Abstract

Personalizing large language models (LLMs) to accommodate diverse user preferences is essential for enhancing alignment and user satisfaction. Traditional reinforcement learning from human feedback (RLHF) approaches often rely on monolithic value representations, limiting their ability to adapt to individual preferences. We introduce a novel framework that leverages low-rank preference modeling to efficiently learn and generalize user-specific reward functions. By representing reward functions in a low-dimensional subspace and modeling individual preferences as weighted combinations of shared basis functions, our approach avoids rigid user categorization while enabling scalability and few-shot adaptation. We validate our method on multiple preference datasets, demonstrating superior generalization to unseen users and improved accuracy in preference prediction tasks.

Paper Structure

This paper contains 13 sections, 21 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Typically preference data from diverse users is pooled together to train a single reward model for everyone. LoRe introduces a more flexible approach by collaboratively learning a shared reward basis from user data. Instead of producing a single reward, this basis generates $B$ latent rewards, which can be combined using a $B$-dimensional weight vector unique to each user to produce personalized rewards. This allows for seamless personalization with minimal effort. For new users, only the user weights need to be learned from few-shot examples, while keeping the reward basis fixed, enabling an efficient and lightweight personalized reward model.
  • Figure 2: We vary the number of few-shot samples and repeat each experiment 20 times, randomly subsampling different examples in each run. The plot reports the average performance (unseen accuracy) along with standard deviations. Notably, VPL, which infers the latent code from few-shot examples without the ability to relearn it, shows limited improvement as the number of examples increases. While PAL exhibits some gains, our algorithm's performance improves significantly faster in comparison.
  • Figure 3: Trainable parameter count vs. number of seen users (log scale). LoRe scales significantly more efficiently than PAL and VPL as the number of users increases. Unlike VPL and PAL, which rely on large MLPs and high-dimensional prototype representations, LoRe uses a lightweight linear projection with shared basis vectors, resulting in dramatically fewer parameters while retaining personalization capabilities.