Table of Contents
Fetching ...

FSPO: Few-Shot Preference Optimization of Synthetic Preference Data in LLMs Elicits Effective Personalization to Real Users

Anikait Singh, Sheryl Hsu, Kyle Hsu, Eric Mitchell, Stefano Ermon, Tatsunori Hashimoto, Archit Sharma, Chelsea Finn

TL;DR

This work tackles the challenge of personalizing LLM outputs by modeling a distribution of user-specific rewards rather than a single population-wide reward. It introduces Few-Shot Preference Optimization (FSPO), a black-box meta-learning framework thatrapidly adapts to individual users from few-shot preferences, optionally aided by a user-description Chain-of-Thought to enhance conditioning. To scale personalization, FSPO relies on synthetic data—over $10^6$ diverse, structured preferences—that transfer to real users through domain randomization (Sim2Real). Across three domains and with a controlled human study, FSPO achieves strong personalization, including an average Alpaca Eval winrate of $87\%$ on synthetic users and a $72\%$ win rate with real participants, suggesting synthetic data design is a viable path to inclusive, personalized LLMs.

Abstract

Effective personalization of LLMs is critical for a broad range of user-interfacing applications such as virtual assistants and content curation. Inspired by the strong in-context learning capabilities of LLMs, we propose Few-Shot Preference Optimization (FSPO), which reframes reward modeling as a meta-learning problem. Under this framework, an LLM learns to quickly adapt to a user via a few labeled preferences from that user, constructing a personalized reward function for them. Additionally, since real-world preference data is scarce and challenging to collect at scale, we propose careful design choices to construct synthetic preference datasets for personalization, generating over 1M synthetic personalized preferences using publicly available LLMs. In particular, to successfully transfer from synthetic data to real users, we find it crucial for the data to exhibit both high diversity and coherent, self-consistent structure. We evaluate FSPO on personalized open-ended generation for up to 1,500 synthetic users across across three domains: movie reviews, pedagogical adaptation based on educational background, and general question answering, along with a controlled human study. Overall, FSPO achieves an 87% Alpaca Eval winrate on average in generating responses that are personalized to synthetic users and a 72% winrate with real human users in open-ended question answering.

FSPO: Few-Shot Preference Optimization of Synthetic Preference Data in LLMs Elicits Effective Personalization to Real Users

TL;DR

This work tackles the challenge of personalizing LLM outputs by modeling a distribution of user-specific rewards rather than a single population-wide reward. It introduces Few-Shot Preference Optimization (FSPO), a black-box meta-learning framework thatrapidly adapts to individual users from few-shot preferences, optionally aided by a user-description Chain-of-Thought to enhance conditioning. To scale personalization, FSPO relies on synthetic data—over diverse, structured preferences—that transfer to real users through domain randomization (Sim2Real). Across three domains and with a controlled human study, FSPO achieves strong personalization, including an average Alpaca Eval winrate of on synthetic users and a win rate with real participants, suggesting synthetic data design is a viable path to inclusive, personalized LLMs.

Abstract

Effective personalization of LLMs is critical for a broad range of user-interfacing applications such as virtual assistants and content curation. Inspired by the strong in-context learning capabilities of LLMs, we propose Few-Shot Preference Optimization (FSPO), which reframes reward modeling as a meta-learning problem. Under this framework, an LLM learns to quickly adapt to a user via a few labeled preferences from that user, constructing a personalized reward function for them. Additionally, since real-world preference data is scarce and challenging to collect at scale, we propose careful design choices to construct synthetic preference datasets for personalization, generating over 1M synthetic personalized preferences using publicly available LLMs. In particular, to successfully transfer from synthetic data to real users, we find it crucial for the data to exhibit both high diversity and coherent, self-consistent structure. We evaluate FSPO on personalized open-ended generation for up to 1,500 synthetic users across across three domains: movie reviews, pedagogical adaptation based on educational background, and general question answering, along with a controlled human study. Overall, FSPO achieves an 87% Alpaca Eval winrate on average in generating responses that are personalized to synthetic users and a 72% winrate with real human users in open-ended question answering.

Paper Structure

This paper contains 19 sections, 6 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overview of FSPO.$N$ previously collected preferences are fed into the LLM along with the current query, allowing the LLM to personalize its response to the query using the past preferences.
  • Figure 2: User Description Chain-of-Thought (COT). Prediction is a two-stage process: first predicting a (synthetic) user description from the few-shot preferences and next predicting the response.
  • Figure 3: Overview of Domain Randomization Techniques. View-Conditioning (left) decomposes a given question into multiple viewpoints, allowing for diverse response generation. Iterative Persona Generation (right) allows for better structure by removing underspecification of the persona by iteratively refining a persona if it is insufficient to make a preference prediction.
  • Figure 4: Flowchart of Roleplay dataset generation: Starting from a set of traits, a seed persona is constructed and a set of specific questions about that trait. Then responses are constructed with View-Conditioning. The seed personas are then iteratively refined to not be underspecified. Finally, the refined persona is used to score consistent preferences.
  • Figure 5: Disagreement Matrix across 5 users in Roleplay. Here we plot the disagreement of preferences for 5 users. There is a mix of users with high and low disagreement.
  • ...and 2 more figures