Table of Contents
Fetching ...

Preference-Guided Reflective Sampling for Aligning Language Models

Hai Ye, Hwee Tou Ng

TL;DR

This work introduces Preference-Guided Reflective Sampling (PRS), a tree-based data-generation framework for aligning large language models to human preferences within offline RLHF. PRS integrates initial sampling with language-based feedback and iterative refinements, guided by natural-language preferences, to produce higher-reward data than traditional random sampling. Through offline RL training on instruction following and keyword-focused summarization, PRS achieves superior best-of-$N$ performance and demonstrates robust preference adaptation and toxicity reduction. The approach offers improved sampling efficiency and scalability for policy alignment, with potential for integration into broader RLHF pipelines and future work on reasoning tasks and safety considerations.

Abstract

Iterative data generation and model re-training can effectively align large language models(LLMs) to human preferences. The process of data sampling is crucial, as it significantly influences the success of policy improvement. Repeated random sampling is a widely used method that independently queries the model multiple times to generate outputs. In this work, we propose a more effective sampling method, named Preference-Guided Reflective Sampling (PRS). Unlike random sampling, PRS employs a tree-based generation framework to enable more efficient sampling. It leverages adaptive self-refinement techniques to better explore the sampling space. By specifying user preferences in natural language, PRS can further optimize response generation according to these preferences. As a result, PRS can align models to diverse user preferences. Our experiments demonstrate that PRS generates higher-quality responses with significantly higher rewards. On AlpacaEval and Arena-Hard, PRS substantially outperforms repeated random sampling in best-of-$N$ sampling. Moreover, PRS shows strong performance when applied in iterative offline RL training.

Preference-Guided Reflective Sampling for Aligning Language Models

TL;DR

This work introduces Preference-Guided Reflective Sampling (PRS), a tree-based data-generation framework for aligning large language models to human preferences within offline RLHF. PRS integrates initial sampling with language-based feedback and iterative refinements, guided by natural-language preferences, to produce higher-reward data than traditional random sampling. Through offline RL training on instruction following and keyword-focused summarization, PRS achieves superior best-of- performance and demonstrates robust preference adaptation and toxicity reduction. The approach offers improved sampling efficiency and scalability for policy alignment, with potential for integration into broader RLHF pipelines and future work on reasoning tasks and safety considerations.

Abstract

Iterative data generation and model re-training can effectively align large language models(LLMs) to human preferences. The process of data sampling is crucial, as it significantly influences the success of policy improvement. Repeated random sampling is a widely used method that independently queries the model multiple times to generate outputs. In this work, we propose a more effective sampling method, named Preference-Guided Reflective Sampling (PRS). Unlike random sampling, PRS employs a tree-based generation framework to enable more efficient sampling. It leverages adaptive self-refinement techniques to better explore the sampling space. By specifying user preferences in natural language, PRS can further optimize response generation according to these preferences. As a result, PRS can align models to diverse user preferences. Our experiments demonstrate that PRS generates higher-quality responses with significantly higher rewards. On AlpacaEval and Arena-Hard, PRS substantially outperforms repeated random sampling in best-of- sampling. Moreover, PRS shows strong performance when applied in iterative offline RL training.
Paper Structure (24 sections, 4 equations, 20 figures, 9 tables, 1 algorithm)

This paper contains 24 sections, 4 equations, 20 figures, 9 tables, 1 algorithm.

Figures (20)

  • Figure 1: Performance comparison of PRS (ours) and repeated random sampling (Rand) on AlpacaEval v2.0 and Arena-Hard v0.1 using best-of-32 sampling. Each prompt samples 32 responses using Rand or PRS and the response with the highest reward is kept for evaluation.
  • Figure 2: Comparison of repeated random sampling and our method PRS. PRS adopts a tree-based generation framework that learns to adapt and adjust its outputs by reflecting on its already generated data. It can incorporate a specific user preference to optimize responses that align with it. Adjusting preferences will generate tailored responses. For random sampling, it generates samples independently and can use the best-of-$N$ (BoN) method to find the best sample. Both methods share the same sampling budget, which samples the same number of responses for each prompt.
  • Figure 3: PRS: (a) Example: A user requests a brief response with supporting references. The initial response lacks references. After feedback, the revised response includes appropriate references. (b) A preference $\bm{z}$ is added to the input $\bm{x}$. The process begins by sampling $N_0$ initial responses $\mathcal{Y}_0$, from which the optimal response $\bm{y}^*_0$ is selected using a reward model $R$. Then feedback $\bm{f}$ is generated, leading to the sampling of $N_1$ refinements $\mathcal{Y}_1$ to enhance $\bm{y}^*_0$. Finally, $\mathcal{Y}_0$ and $\mathcal{Y}_1$ are merged. Optionally, new refinements may be sampled based on the current best response.
  • Figure 4: Comparing sampling methods. Left: We study the common preference and use the description of Table \ref{['tab:preference']} to generate detailed and in-depth responses. With 100 random prompts from Alpaca-GPT4, each method samples $N$ responses per prompt (i.e., 8, 16, 32, 64, or 128). The top three highest rewards are averaged for each prompt, leading to an overall average score for the entire evaluation set. The full results of 9 policy models are shown in Fig. \ref{['fig:full-compare-sampling']}. Middle: The distribution of rewards with $N=128$, where PRS is PRS $(N/2, N/2)$. Right: Summarization results on 100 random documents from CNN / Daily Mail. The policy model is Llama-2-13b+SFT.
  • Figure 5: Offline RL training: Win rates for PRS, PRand, and Rand + p vs. Base + p, evaluated using GPT-4 on a 200-sample AlpacaEval. "+ p" adds common preference in the input.
  • ...and 15 more figures