Table of Contents
Fetching ...

Shallow Preference Signals: Large Language Model Aligns Even Better with Truncated Data?

Xuan Qi, Jiahao Qiu, Xinzhe Juan, Yue Wu, Mengdi Wang

TL;DR

This paper reveals that the signal distinguishing preferred from non-preferred LLM outputs is predominantly concentrated in the early portion of responses, coining the term shallow preference signals. It demonstrates that training reward models and Direct Preference Optimization on truncated, early-token data can achieve performance on par with or better than training on full responses, across multiple datasets and supervision settings. To exploit this, the authors introduce a mixing strategy and two decoding policies (Length Control and KL Threshold Control) that improve the reward-KL tradeoff by emphasizing early signals. They validate the phenomenon on both synthetic and human-generated data, showing robustness of the results and discussing implications for real-world alignment, including potential limitations of focusing on early tokens. Overall, the work highlights a practical path to more efficient LLM alignment while calling for deeper theoretical understanding and attention to whole-response quality in future methods.

Abstract

Aligning large language models (LLMs) with human preferences remains a key challenge in AI. Preference-based optimization methods, such as Reinforcement Learning with Human Feedback (RLHF) and Direct Preference Optimization (DPO), rely on human-annotated datasets to improve alignment. In this work, we identify a crucial property of the existing learning method: the distinguishing signal obtained in preferred responses is often concentrated in the early tokens. We refer to this as shallow preference signals. To explore this property, we systematically truncate preference datasets at various points and train both reward models and DPO models on the truncated data. Surprisingly, models trained on truncated datasets, retaining only the first half or fewer tokens, achieve comparable or even superior performance to those trained on full datasets. For example, a reward model trained on the Skywork-Reward-Preference-80K-v0.2 dataset outperforms the full dataset when trained on a 40\% truncated dataset. This pattern is consistent across multiple datasets, suggesting the widespread presence of shallow preference signals. We further investigate the distribution of the reward signal through decoding strategies. We consider two simple decoding strategies motivated by the shallow reward signal observation, namely Length Control Decoding and KL Threshold Control Decoding, which leverage shallow preference signals to optimize the trade-off between alignment and computational efficiency. The performance is even better, which again validates our hypothesis. The phenomenon of shallow preference signals highlights potential issues in LLM alignment: existing alignment methods often focus on aligning only the initial tokens of responses, rather than considering the full response. This could lead to discrepancies with real-world human preferences, resulting in suboptimal alignment performance.

Shallow Preference Signals: Large Language Model Aligns Even Better with Truncated Data?

TL;DR

This paper reveals that the signal distinguishing preferred from non-preferred LLM outputs is predominantly concentrated in the early portion of responses, coining the term shallow preference signals. It demonstrates that training reward models and Direct Preference Optimization on truncated, early-token data can achieve performance on par with or better than training on full responses, across multiple datasets and supervision settings. To exploit this, the authors introduce a mixing strategy and two decoding policies (Length Control and KL Threshold Control) that improve the reward-KL tradeoff by emphasizing early signals. They validate the phenomenon on both synthetic and human-generated data, showing robustness of the results and discussing implications for real-world alignment, including potential limitations of focusing on early tokens. Overall, the work highlights a practical path to more efficient LLM alignment while calling for deeper theoretical understanding and attention to whole-response quality in future methods.

Abstract

Aligning large language models (LLMs) with human preferences remains a key challenge in AI. Preference-based optimization methods, such as Reinforcement Learning with Human Feedback (RLHF) and Direct Preference Optimization (DPO), rely on human-annotated datasets to improve alignment. In this work, we identify a crucial property of the existing learning method: the distinguishing signal obtained in preferred responses is often concentrated in the early tokens. We refer to this as shallow preference signals. To explore this property, we systematically truncate preference datasets at various points and train both reward models and DPO models on the truncated data. Surprisingly, models trained on truncated datasets, retaining only the first half or fewer tokens, achieve comparable or even superior performance to those trained on full datasets. For example, a reward model trained on the Skywork-Reward-Preference-80K-v0.2 dataset outperforms the full dataset when trained on a 40\% truncated dataset. This pattern is consistent across multiple datasets, suggesting the widespread presence of shallow preference signals. We further investigate the distribution of the reward signal through decoding strategies. We consider two simple decoding strategies motivated by the shallow reward signal observation, namely Length Control Decoding and KL Threshold Control Decoding, which leverage shallow preference signals to optimize the trade-off between alignment and computational efficiency. The performance is even better, which again validates our hypothesis. The phenomenon of shallow preference signals highlights potential issues in LLM alignment: existing alignment methods often focus on aligning only the initial tokens of responses, rather than considering the full response. This could lead to discrepancies with real-world human preferences, resulting in suboptimal alignment performance.

Paper Structure

This paper contains 37 sections, 22 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: An example illustrating the phenomenon of shallow preference signals. It demonstrates how the relative quality of two responses can be determined from the early portion of the response, or even from the first sentence. Training with only the initial part allows the model to capture most of the preference signals while conserving resources.
  • Figure 2: The x-axis represents the response truncation length and ratio, while the y-axis shows the accuracy of DPO implicit reward in predicting the relative quality of responses based on truncated datasets.
  • Figure 3: KL Divergence between the DPO model and the reference model at each token position. The plot shows that the divergence is higher for early tokens and decreases as generation progresses.
  • Figure 4: Reward and corresponding KL Divergence for the baseline and two different control strategies. The blue dots represent data from the baseline, while the red triangles and green squares represent the Length Control and KL Threshold Control strategies, respectively.
  • Figure 5: Accuracy of DPO implicit reward in predicting the relative quality of responses on the human-generated SHP dataset with truncated responses. The x-axis represents the truncation ratio and length, and the y-axis shows the accuracy of DPO implicit reward predictions compared to human annotations.