Table of Contents
Fetching ...

SimPER: A Minimalist Approach to Preference Alignment without Hyperparameters

Teng Xiao, Yige Yuan, Zhengyu Chen, Mingxiao Li, Shangsong Liang, Zhaochun Ren, Vasant G Honavar

TL;DR

Hyperparameter tuning in preference alignment poses practical bottlenecks for real-world deployment. SimPER introduces a hyperparameter-free offline objective that directly optimizes reverse perplexity between chosen and rejected responses, without requiring a reference model, yielding strong performance across Open LLM Leaderboard benchmarks and instruction-following tasks. The approach is supported by gradient and divergence analyses showing balanced updates and TVD minimization, implying mode-seeking alignment. Empirical results and ablations demonstrate robustness to settings, with notable gains on AlpacaEval 2 and MT-Bench, and favorable perplexity-density and length-bias characteristics. The method offers a simple, memory-efficient path to effective alignment with real-world applicability, and the authors provide public code for adoption.

Abstract

Existing preference optimization objectives for language model alignment require additional hyperparameters that must be extensively tuned to achieve optimal performance, increasing both the complexity and time required for fine-tuning large language models. In this paper, we propose a simple yet effective hyperparameter-free preference optimization algorithm for alignment. We observe that promising performance can be achieved simply by optimizing inverse perplexity, which is calculated as the inverse of the exponentiated average log-likelihood of the chosen and rejected responses in the preference dataset. The resulting simple learning objective, SimPER, is easy to implement and eliminates the need for expensive hyperparameter tuning and a reference model, making it both computationally and memory efficient. Extensive experiments on widely used real-world benchmarks, including MT-Bench, AlpacaEval 2, and 10 key benchmarks of the Open LLM Leaderboard with 5 base models, demonstrate that SimPER consistently and significantly outperforms existing approaches-even without any hyperparameters or a reference model . For example, despite its simplicity, SimPER outperforms state-of-the-art methods by up to 5.7 points on AlpacaEval 2 and achieves the highest average ranking across 10 benchmarks on the Open LLM Leaderboard. The source code for SimPER is publicly available at: https://github.com/tengxiao1/SimPER.

SimPER: A Minimalist Approach to Preference Alignment without Hyperparameters

TL;DR

Hyperparameter tuning in preference alignment poses practical bottlenecks for real-world deployment. SimPER introduces a hyperparameter-free offline objective that directly optimizes reverse perplexity between chosen and rejected responses, without requiring a reference model, yielding strong performance across Open LLM Leaderboard benchmarks and instruction-following tasks. The approach is supported by gradient and divergence analyses showing balanced updates and TVD minimization, implying mode-seeking alignment. Empirical results and ablations demonstrate robustness to settings, with notable gains on AlpacaEval 2 and MT-Bench, and favorable perplexity-density and length-bias characteristics. The method offers a simple, memory-efficient path to effective alignment with real-world applicability, and the authors provide public code for adoption.

Abstract

Existing preference optimization objectives for language model alignment require additional hyperparameters that must be extensively tuned to achieve optimal performance, increasing both the complexity and time required for fine-tuning large language models. In this paper, we propose a simple yet effective hyperparameter-free preference optimization algorithm for alignment. We observe that promising performance can be achieved simply by optimizing inverse perplexity, which is calculated as the inverse of the exponentiated average log-likelihood of the chosen and rejected responses in the preference dataset. The resulting simple learning objective, SimPER, is easy to implement and eliminates the need for expensive hyperparameter tuning and a reference model, making it both computationally and memory efficient. Extensive experiments on widely used real-world benchmarks, including MT-Bench, AlpacaEval 2, and 10 key benchmarks of the Open LLM Leaderboard with 5 base models, demonstrate that SimPER consistently and significantly outperforms existing approaches-even without any hyperparameters or a reference model . For example, despite its simplicity, SimPER outperforms state-of-the-art methods by up to 5.7 points on AlpacaEval 2 and achieves the highest average ranking across 10 benchmarks on the Open LLM Leaderboard. The source code for SimPER is publicly available at: https://github.com/tengxiao1/SimPER.

Paper Structure

This paper contains 19 sections, 2 theorems, 23 equations, 6 figures, 9 tables.

Key Result

Theorem 3.1

Minimizing SFT with respect to ${\theta}$ is approximately minimizing the KLD between $\pi_\theta$ and the distribution of the chosen response in the preference dataset, while minimizing our SimPER is approximately minimizing the TVD.

Figures (6)

  • Figure 1: Evaluation on the MT-Bench Score (1-10) of SimPO and our SimPER across different large language models reveals the high sensitivity and instability of SimPO with respect to its hyperparameter $\gamma$ across models. In contrast, our SimPER, which operates without any hyperparameters in the objective function, consistently and significantly outperforms SimPO across a wide range of models. Additional experimental evidence on other widely used benchmarks is provided in Section \ref{['sec:exp']}.
  • Figure 2: Illustration of the characteristics of KLD and TVD. While SFT exhibits mass-covering behavior by minimizing forward KL, SimPER exhibits mode-seeking behavior, similar to RLHF tajwar2024preference, by minimizing TVD.
  • Figure 3: The training dynamics during training of SimPER and SimPO with different hyperparameters on the Mistral-7B (Results on Llama3-8B can be found in Section \ref{['sec:abl']}). We can observe that SimPER exhibits the least decline in chosen likelihoods, while still achieving the most significant increase in likelihood margins of rejected and chosen, compared to SimPO across various hyperparameters.
  • Figure 4: The win rates, computed by GPT-4, in comparison to the chosen responses of test prompts in the Anthropic-HH dataset.
  • Figure 5: The training dynamics during training of SimPER and SimPO with different hyperparameters on the Llama3-8B-Base. We can observe that SimPER exhibits the least decline in chosen likelihoods, while still achieving the most significant increase in likelihood margins of rejected and chosen, compared to SimPO across various hyperparameters, and better performance as shown in Table \ref{['tab:main_res']}.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Theorem 3.1
  • Lemma A.1
  • proof
  • proof