Table of Contents
Fetching ...

Personalized LLM Decoding via Contrasting Personal Preference

Hyungjune Bu, Chanjoo Jung, Minjae Kang, Jaehyung Kim

TL;DR

This paper proposes CoPe (Contrasting Personal Preference), a novel decoding-time approach applied after performing parameter-efficient fine-tuning on user-specific data to leverage reward-guided decoding specifically for personalization by maximizing each user's implicit reward signal.

Abstract

As large language models (LLMs) are progressively deployed in various real-world applications, personalization of LLMs has become increasingly important. While various approaches to LLM personalization such as prompt-based and training-based methods have been actively explored, the development of effective decoding-time algorithms remains largely overlooked, despite their demonstrated potential. In this paper, we propose CoPe (Contrasting Personal Preference), a novel decoding-time approach applied after performing parameter-efficient fine-tuning (PEFT) on user-specific data. Our core idea is to leverage reward-guided decoding specifically for personalization by maximizing each user's implicit reward signal. We evaluate CoPe across five open-ended personalized text generation tasks. Our empirical results demonstrate that CoPe achieves strong performance, improving personalization by an average of 10.57% in ROUGE-L, without relying on external reward models or additional training procedures.

Personalized LLM Decoding via Contrasting Personal Preference

TL;DR

This paper proposes CoPe (Contrasting Personal Preference), a novel decoding-time approach applied after performing parameter-efficient fine-tuning on user-specific data to leverage reward-guided decoding specifically for personalization by maximizing each user's implicit reward signal.

Abstract

As large language models (LLMs) are progressively deployed in various real-world applications, personalization of LLMs has become increasingly important. While various approaches to LLM personalization such as prompt-based and training-based methods have been actively explored, the development of effective decoding-time algorithms remains largely overlooked, despite their demonstrated potential. In this paper, we propose CoPe (Contrasting Personal Preference), a novel decoding-time approach applied after performing parameter-efficient fine-tuning (PEFT) on user-specific data. Our core idea is to leverage reward-guided decoding specifically for personalization by maximizing each user's implicit reward signal. We evaluate CoPe across five open-ended personalized text generation tasks. Our empirical results demonstrate that CoPe achieves strong performance, improving personalization by an average of 10.57% in ROUGE-L, without relying on external reward models or additional training procedures.

Paper Structure

This paper contains 35 sections, 12 equations, 9 figures, 13 tables.

Figures (9)

  • Figure 1: Implicit reward maximization via contrastive preference. Under an implicit reward model that leverages the interaction between a personalized and a non-personalized generic model, generated texts better align with user preferences. The highlighted text marks words that overlaps with the gold answer.
  • Figure 2: Illustration of CoPe (Contrasting Preference for Personalized LLM Decoding). The training pipeline (left) builds an expert user model via Direct Preference Optimization (DPO) with synthetic negatives. The reward-guided decoding method (right) contrasts this user model with a base model at the token level, maximizing implicit user reward during both training and decoding for improved personalization.
  • Figure 3: Different hyperparameters. (a) Performance variation by base model choice. (b) Effect of contrastive strength $\alpha$. (c) Effect of KL regularization $\beta$ in DPO. ROUGE-1 and ROUGE-L scores are reported.
  • Figure 4: A qualitative example of CoPe on the News Headline Generation task (LaMP 4). The output of CoPe contains more words that align with the user gold response compared to TAM and OPPU. Words overlapping with the user’s answer are highlighted, and tokens that CoPe uniquely emphasizes for personalization, which are not captured by other baselines, are boxed. More qualitative examples from other tasks are provided in Appendix \ref{['app:more_qual']}.
  • Figure 5: Example illustrating perplexity differences between gold and generated text. A user's gold text often contains spelling variations, colloquial expressions, and conversational truncations, which are harder for a reference LM to predict, resulting in higher perplexity. In contrast, model-generated text tends to be more regular and thus achieves lower perplexity.
  • ...and 4 more figures