Table of Contents
Fetching ...

A Systematic Examination of Preference Learning through the Lens of Instruction-Following

Joongwon Kim, Anirudh Goyal, Aston Zhang, Bo Xiong, Rui Hou, Melanie Kambadur, Dhruv Mahajan, Hannaneh Hajishirzi, Liang Tan

TL;DR

The paper tackles how attributes of automatically generated preference datasets affect instruction-following alignment in large language models. It builds a synthetic data pipeline that produces 48K prompts from 23 verifiable constraints and compares rejection sampling (RS) with Monte Carlo Tree Search (MCTS) for creating (chosen, rejected) pairs. Through rigorous experiments, it finds that shared prefixes from MCTS yield stable, modest gains, high-contrast pairs boost performance while mixtures can be beneficial, and moderate prompt difficulty enhances generalization. The work provides a scalable framework and practical guidance for curating preference data to improve LLM alignment and instruction-following capabilities.

Abstract

Preference learning is a widely adopted post-training technique that aligns large language models (LLMs) to human preferences and improves specific downstream task capabilities. In this work we systematically investigate how specific attributes of preference datasets affect the alignment and downstream performance of LLMs in instruction-following tasks. We use a novel synthetic data generation pipeline to generate 48,000 unique instruction-following prompts with combinations of 23 verifiable constraints that enable fine-grained and automated quality assessments of model responses. With our synthetic prompts, we use two preference dataset curation methods - rejection sampling (RS) and Monte Carlo Tree Search (MCTS) - to obtain pairs of (chosen, rejected) responses. Then, we perform experiments investigating the effects of (1) the presence of shared prefixes between the chosen and rejected responses, (2) the contrast and quality of the chosen, rejected responses and (3) the complexity of the training prompts. Our experiments reveal that shared prefixes in preference pairs, as generated by MCTS, provide marginal but consistent improvements and greater stability across challenging training configurations. High-contrast preference pairs generally outperform low-contrast pairs; however, combining both often yields the best performance by balancing diversity and learning efficiency. Additionally, training on prompts of moderate difficulty leads to better generalization across tasks, even for more complex evaluation scenarios, compared to overly challenging prompts. Our findings provide actionable insights into optimizing preference data curation for instruction-following tasks, offering a scalable and effective framework for enhancing LLM training and alignment.

A Systematic Examination of Preference Learning through the Lens of Instruction-Following

TL;DR

The paper tackles how attributes of automatically generated preference datasets affect instruction-following alignment in large language models. It builds a synthetic data pipeline that produces 48K prompts from 23 verifiable constraints and compares rejection sampling (RS) with Monte Carlo Tree Search (MCTS) for creating (chosen, rejected) pairs. Through rigorous experiments, it finds that shared prefixes from MCTS yield stable, modest gains, high-contrast pairs boost performance while mixtures can be beneficial, and moderate prompt difficulty enhances generalization. The work provides a scalable framework and practical guidance for curating preference data to improve LLM alignment and instruction-following capabilities.

Abstract

Preference learning is a widely adopted post-training technique that aligns large language models (LLMs) to human preferences and improves specific downstream task capabilities. In this work we systematically investigate how specific attributes of preference datasets affect the alignment and downstream performance of LLMs in instruction-following tasks. We use a novel synthetic data generation pipeline to generate 48,000 unique instruction-following prompts with combinations of 23 verifiable constraints that enable fine-grained and automated quality assessments of model responses. With our synthetic prompts, we use two preference dataset curation methods - rejection sampling (RS) and Monte Carlo Tree Search (MCTS) - to obtain pairs of (chosen, rejected) responses. Then, we perform experiments investigating the effects of (1) the presence of shared prefixes between the chosen and rejected responses, (2) the contrast and quality of the chosen, rejected responses and (3) the complexity of the training prompts. Our experiments reveal that shared prefixes in preference pairs, as generated by MCTS, provide marginal but consistent improvements and greater stability across challenging training configurations. High-contrast preference pairs generally outperform low-contrast pairs; however, combining both often yields the best performance by balancing diversity and learning efficiency. Additionally, training on prompts of moderate difficulty leads to better generalization across tasks, even for more complex evaluation scenarios, compared to overly challenging prompts. Our findings provide actionable insights into optimizing preference data curation for instruction-following tasks, offering a scalable and effective framework for enhancing LLM training and alignment.

Paper Structure

This paper contains 23 sections, 8 equations, 6 figures, 20 tables.

Figures (6)

  • Figure 1: Automatically curating preference pairs via rejection sampling (RS, left) and Monte Carlo Tree Search (MCTS, right). RS: We independently sample $N$ different outputs from the policy, score each output with a verifier and take (high, low) scoring responses as the (chosen, rejected) pairs. MCTS: We perform tree search with the policy while generating multiple actions per each search iteration. Then, we use the rollouts from sibling nodes with (high, low) reward scores as the (chosen, rejected) pairs to obtain preference pairs with common prefixes up to the parent nodes.
  • Figure 2: Overview of our pipeline for generating synthetic prompts with verifiable constraints. We first take a set of seed prompts from an existing dataset where the prompts contain constraints, and remove all constraints with an LLM (llama-3.1-70b-instruct) to obtain base prompts corresponding to the original dataset. Next, we randomly sample a small subset of the base prompts and use them as few-shot examples to generate new prompts without any constraints. We remove duplicates among the newly-generated prompts and the existing base prompts using a sentence transformer. Then, we randomly sample a combination of $k \in \{4,5,6\}$ of our verifiable constraints that are non-conflicting and use an LLM to generate the input parameters required for the set of selected constraints. Finally, we use the resulting input kwargs and the new base prompts to generate the final prompts that integrate the constraints in natural language.
  • Figure 3: Number of preference pairs for different correctness filtering criteria at $k=5$. The light blue color indicates preference pairs obtained via rejection sampling (RS), and the dark blue color indicates preference pairs obtained via Monte Carlo Tree Search (MCTS). The left subfigure shows the number of unique prompts with (chosen, rejected) responses associated with each filtering criteria, and the right subfigure shows the total number of preference pairs with (chosen, rejected) responses associated with each filtering criteria.
  • Figure 4: Evaluation results demonstrating the effects of mixing preference pairs with different margins between the (chosen, rejected) responses. The two rows correspond to our training setup with different values of $k$ (number of verifiable constraints in each prompt), and the four columns correspond to our evaluation sets. The x-axis indicates the correctness of the (chosen, rejected) responses with lower-margin pairs mixed in while keeping the same training size. The y-axis represents the accuracies. Results for more experiments are provided in Tables \ref{['tab:response_quality_results_mix_k4']} and \ref{['tab:response_quality_results_mix_k5']} in the appendix.
  • Figure 5: Evaluation results for increasing the number of outputs generated for rejection sampling (RS) from $N=4$ to $N=64$, given a set of training prompts with $k=5$ and (c, r) = (4, 1) or (4, 2). We observe a steady improvement in performance as more outputs are generated per prompt until $N=32$, where it begins to saturate or even deteriorate.
  • ...and 1 more figures