Table of Contents
Fetching ...

On the Role of Difficult Prompts in Self-Play Preference Optimization

Yao Xiao, Jung-jae Kim, Roy Ka-wei Lee, Lidong Bing

TL;DR

The paper examines the overlooked role of prompts in self-play preference optimization, introducing mean reward over $N$ sampled responses as a practical proxy for prompt difficulty and showing that hard prompts (low mean reward) typically underperform easier ones under DPO. It demonstrates that incorporating hard prompts offers little to no improvement—and can slightly degrade performance—though a stronger model capacity can close the gap between hard and easy prompts. Several mitigation attempts are tested, including curriculum learning and improved preference pairs, but only pruning the most difficult prompts yields reliable gains and compute savings. The findings emphasize that prompt difficulty interacts with model capacity and suggest adaptive prompt filtering as a practical route to enhance efficiency and alignment outcomes in self-play pipelines.

Abstract

Self-play preference optimization has emerged as a prominent paradigm for aligning large language models (LLMs). It typically involves a language model to generate on-policy responses for prompts and a reward model (RM) to guide the selection of chosen and rejected responses, which can be further trained with direct preference optimization (DPO). However, the role of prompts remains underexplored, despite being a core component in this pipeline. In this work, we investigate how prompts of varying difficulty influence self-play preference optimization. We first use the mean reward of $N$ sampled responses of a prompt as a proxy for its difficulty. We find that difficult prompts exhibit substantially inferior self-play optimization performance in comparison to easy prompts for language models. Moreover, incorporating difficult prompts into training fails to enhance overall performance and, in fact, leads to slight degradation compared to training on easy prompts alone. We also observe that the performance gap between difficult and easy prompts closes as the model capacity increases, suggesting that difficulty interacts with the model capacity. Building on these findings, we explore strategies to mitigate the negative effect of difficult prompts on final performance. We demonstrate that selectively removing an appropriate portion of challenging prompts enhances overall self-play performance, while also reporting failed attempts and lessons learned.

On the Role of Difficult Prompts in Self-Play Preference Optimization

TL;DR

The paper examines the overlooked role of prompts in self-play preference optimization, introducing mean reward over sampled responses as a practical proxy for prompt difficulty and showing that hard prompts (low mean reward) typically underperform easier ones under DPO. It demonstrates that incorporating hard prompts offers little to no improvement—and can slightly degrade performance—though a stronger model capacity can close the gap between hard and easy prompts. Several mitigation attempts are tested, including curriculum learning and improved preference pairs, but only pruning the most difficult prompts yields reliable gains and compute savings. The findings emphasize that prompt difficulty interacts with model capacity and suggest adaptive prompt filtering as a practical route to enhance efficiency and alignment outcomes in self-play pipelines.

Abstract

Self-play preference optimization has emerged as a prominent paradigm for aligning large language models (LLMs). It typically involves a language model to generate on-policy responses for prompts and a reward model (RM) to guide the selection of chosen and rejected responses, which can be further trained with direct preference optimization (DPO). However, the role of prompts remains underexplored, despite being a core component in this pipeline. In this work, we investigate how prompts of varying difficulty influence self-play preference optimization. We first use the mean reward of sampled responses of a prompt as a proxy for its difficulty. We find that difficult prompts exhibit substantially inferior self-play optimization performance in comparison to easy prompts for language models. Moreover, incorporating difficult prompts into training fails to enhance overall performance and, in fact, leads to slight degradation compared to training on easy prompts alone. We also observe that the performance gap between difficult and easy prompts closes as the model capacity increases, suggesting that difficulty interacts with the model capacity. Building on these findings, we explore strategies to mitigate the negative effect of difficult prompts on final performance. We demonstrate that selectively removing an appropriate portion of challenging prompts enhances overall self-play performance, while also reporting failed attempts and lessons learned.

Paper Structure

This paper contains 37 sections, 3 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: We follow this preference pair construction approach introduced by meng2024simpoli2025simplemix. The samples of maximal reward and minimal reward can make a preference pair for training.
  • Figure 2: We show the mean reward distribution of $N$ sampled responses per prompt on Llama-3.1-Tulu-3-8B-SFT and Mistral-7B-Instruct-v0.2 for prompts of UltraFeedback pmlr-v235-cui24f. We find $10$ samples per prompt are sufficient to obtain a stable estimate.
  • Figure 3: We present the results of dropping the most difficult quartile of prompts and the full set results on AlpacaEval 2. We can see that incorporating the hardest quartile of prompts into training may hurt the final performance of models.
  • Figure 4: We present the performance of model on AlpacaEval 2 as we change $k$ from $10$ to $50$ percent on Tulu. Performance first improves and then degrades as we prune more and more hard prompts. Performance reaches its peak when we remove about $30$ percent of difficult prompts.
  • Figure 5: We present the performance of model on AlpacaEval 2 as we change $k$ from 10 to 50 percent on Mistral. Performance first improve and then degrade as we prune more and more hard prompts. Performance reaches peak when we remove about 50 percent of most difficult prompts.