Table of Contents
Fetching ...

A Shared Low-Rank Adaptation Approach to Personalized RLHF

Renpu Liu, Peng Wang, Donghao Li, Cong Shen, Jing Yang

TL;DR

The paper tackles personalization in RLHF by modeling multiple users' reward functions through a low-rank adaptation framework. It introduces P-ShareLoRA, which learns a shared low-rank basis $oldsymbol{B}$ and user-specific adapters $oldsymbol{W}_i$, ensuring the aggregated adaptation $oldsymbol{ riangle}oldsymbol{ heta}$ has rank at most $k$ while constraining individual norms. The authors prove theoretical guarantees, including a bound on the subspace distance between the learned and optimal representations and a reduced bracketing-number complexity relative to full Fine-Tuning, yielding improved sample complexity especially when $d_1 frac{k}$. They also provide bounds on per-user and averaged value-function gaps under a pessimistic/robust optimization scheme and demonstrate empirical gains on Reddit TL;DR with GPT-J 6B and Llama-3 8B across multiple labelers. The work shows that leveraging shared structure in personalized RLHF can achieve near-optimal performance with fewer samples and computational resources, improving alignment quality for diverse user preferences in real-world settings.

Abstract

Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal technique for aligning artificial intelligence systems with human values, achieving remarkable success in fine-tuning large language models. However, existing RLHF frameworks often assume that human preferences are relatively homogeneous and can be captured by a single, unified reward model. This assumption overlooks the inherent diversity and heterogeneity across individuals, limiting the adaptability of RLHF to personalized scenarios and risking misalignments that can diminish user satisfaction and trust in AI systems. In this paper, we address these challenges by introducing Low-Rank Adaptation (LoRA) into the personalized RLHF framework. We apply LoRA in the the aggregated parameter space of all personalized reward functions, thereby enabling efficient learning of personalized reward models from potentially limited local datasets. Our approach exploits potential shared structures among the local ground-truth reward models while allowing for individual adaptation, without relying on restrictive assumptions about shared representations as in prior works. We further establish sample complexity guarantees for our method. Theoretical analysis demonstrates the effectiveness of the proposed approach in capturing both shared and individual-specific structures within heterogeneous human preferences, addressing the dual challenge of personalization requirements and practical data constraints. Experimental results on real-world datasets corroborate the efficiency of our algorithm in the personalized RLHF setting.

A Shared Low-Rank Adaptation Approach to Personalized RLHF

TL;DR

The paper tackles personalization in RLHF by modeling multiple users' reward functions through a low-rank adaptation framework. It introduces P-ShareLoRA, which learns a shared low-rank basis and user-specific adapters , ensuring the aggregated adaptation has rank at most while constraining individual norms. The authors prove theoretical guarantees, including a bound on the subspace distance between the learned and optimal representations and a reduced bracketing-number complexity relative to full Fine-Tuning, yielding improved sample complexity especially when . They also provide bounds on per-user and averaged value-function gaps under a pessimistic/robust optimization scheme and demonstrate empirical gains on Reddit TL;DR with GPT-J 6B and Llama-3 8B across multiple labelers. The work shows that leveraging shared structure in personalized RLHF can achieve near-optimal performance with fewer samples and computational resources, improving alignment quality for diverse user preferences in real-world settings.

Abstract

Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal technique for aligning artificial intelligence systems with human values, achieving remarkable success in fine-tuning large language models. However, existing RLHF frameworks often assume that human preferences are relatively homogeneous and can be captured by a single, unified reward model. This assumption overlooks the inherent diversity and heterogeneity across individuals, limiting the adaptability of RLHF to personalized scenarios and risking misalignments that can diminish user satisfaction and trust in AI systems. In this paper, we address these challenges by introducing Low-Rank Adaptation (LoRA) into the personalized RLHF framework. We apply LoRA in the the aggregated parameter space of all personalized reward functions, thereby enabling efficient learning of personalized reward models from potentially limited local datasets. Our approach exploits potential shared structures among the local ground-truth reward models while allowing for individual adaptation, without relying on restrictive assumptions about shared representations as in prior works. We further establish sample complexity guarantees for our method. Theoretical analysis demonstrates the effectiveness of the proposed approach in capturing both shared and individual-specific structures within heterogeneous human preferences, addressing the dual challenge of personalization requirements and practical data constraints. Experimental results on real-world datasets corroborate the efficiency of our algorithm in the personalized RLHF setting.

Paper Structure

This paper contains 27 sections, 23 theorems, 134 equations, 3 figures, 4 algorithms.

Key Result

Theorem 2.1

(Closeness between $\widehat{\mathbf{B}}$ and $\mathbf{B}^\diamond$). Suppose assumption:reward holds. For any $\delta \in (0,1]$, with probability at least $1 - \delta$, it holds that where $c_1 > 0$ is a constant.

Figures (3)

  • Figure 1: Prediction Accuracy of Different Algorithms.
  • Figure 2: Accuracies of Different Methods Across Labelers (Llama3 8B)
  • Figure 3: Compare Accuracy between Share A and Share B

Theorems & Definitions (31)

  • Definition 2.1: Diversity Metrics
  • Remark 1
  • Definition 2.2: Bracketing Number for Reward Vectors park2024rlhf
  • Theorem 2.1
  • Remark 2
  • Proposition 1
  • Theorem 2.2
  • Corollary 2.1
  • Remark 3: Sample Complexity
  • Definition A.1: Principal Angle Distance jain2013low
  • ...and 21 more