Table of Contents
Fetching ...

Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment

Hiroshi Takahashi, Tomoharu Iwata, Atsutoshi Kumagai, Sekitoshi Kanai, Masanori Yamada, Kosuke Nishida, Kazutoshi Shinoda

Abstract

Aligning language models with human preferences is essential for ensuring their safety and reliability. Although most existing approaches assume specific human preference models such as the Bradley-Terry model, this assumption may fail to accurately capture true human preferences, and consequently, these methods lack statistical consistency, i.e., the guarantee that language models converge to the true human preference as the number of samples increases. In contrast, direct density ratio optimization (DDRO) achieves statistical consistency without assuming any human preference models. DDRO models the density ratio between preferred and non-preferred data distributions using the language model, and then optimizes it via density ratio estimation. However, this density ratio is unstable and often diverges, leading to training instability of DDRO. In this paper, we propose a novel alignment method that is both stable and statistically consistent. Our approach is based on the relative density ratio between the preferred data distribution and a mixture of the preferred and non-preferred data distributions. Our approach is stable since this relative density ratio is bounded above and does not diverge. Moreover, it is statistically consistent and yields significantly tighter convergence guarantees than DDRO. We experimentally show its effectiveness with Qwen 2.5 and Llama 3.

Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment

Abstract

Aligning language models with human preferences is essential for ensuring their safety and reliability. Although most existing approaches assume specific human preference models such as the Bradley-Terry model, this assumption may fail to accurately capture true human preferences, and consequently, these methods lack statistical consistency, i.e., the guarantee that language models converge to the true human preference as the number of samples increases. In contrast, direct density ratio optimization (DDRO) achieves statistical consistency without assuming any human preference models. DDRO models the density ratio between preferred and non-preferred data distributions using the language model, and then optimizes it via density ratio estimation. However, this density ratio is unstable and often diverges, leading to training instability of DDRO. In this paper, we propose a novel alignment method that is both stable and statistically consistent. Our approach is based on the relative density ratio between the preferred data distribution and a mixture of the preferred and non-preferred data distributions. Our approach is stable since this relative density ratio is bounded above and does not diverge. Moreover, it is statistically consistent and yields significantly tighter convergence guarantees than DDRO. We experimentally show its effectiveness with Qwen 2.5 and Llama 3.

Paper Structure

This paper contains 24 sections, 5 theorems, 61 equations, 7 figures, 4 tables, 1 algorithm.

Key Result

Theorem 3.1

Let us define $\hat{\theta}=\mathop{\mathrm{arg\,min}}\limits_{\theta}\hat{\mathcal{L}}_{\mathrm{RDRE}}(\theta)$, where $\hat{\mathcal{L}}_{\mathrm{RDRE}}(\theta)$ is the empirical approximation of $\mathcal{L}_{\mathrm{RDRE}}(\theta)$ by using the datasets $\mathcal{D}^{\pm}=\left\{ \mathcal{D}^{+} where $\mu$ satisfies $f^{\prime\prime}(t)\geq\mu$ for all $t$. $\blacktriangleleft$$\blacktriangle

Figures (7)

  • Figure 1: (a) Comparing density ratio $g^{*}(y|x)$ and relative density ratio $r^{*}(y|x)$, where non-preferred data distribution $p^{-}(y|x)=0.1$ and the hyperparameter $\alpha=0.5$. Although $g^{*}(y|x)$ diverges as $p^{+}(y|x)\to 0$, $r^{*}(y|x)$ is bounded above by $1/\alpha$. (b) Our loss functions for preferred (blue) and non-preferred (orange) samples with $\alpha=0.3$. The loss for preferred samples is minimized when $T_{\theta} \equiv \log p_{\theta}(y|x) - \log p_{\mathrm{ref}}(y|x)$ attains $\log (1/\alpha)$.
  • Figure 2: Relationship between AlpacaEval LC win rates and the hyperparameter $\alpha$ on UF-G. The semi-transparent area represents standard deviations.
  • Figure 3: Relationship between AlpacaEval LC win rates and the hyperparameter $\alpha$ on MIX-14K. The semi-transparent area represents standard deviations.
  • Figure 4: Training losses over steps for DDRO, KTO, and RDRO on Llama-8B with UF-G and $\alpha=0.39$. Note that the loss values are not directly comparable across methods since the objectives differ.
  • Figure 5: Gradient norms over steps for DDRO, KTO, and RDRO on Llama-8B with UF-G and $\alpha=0.39$.
  • ...and 2 more figures

Theorems & Definitions (10)

  • Theorem 3.1
  • proof : Proof sketch
  • Lemma B.1
  • proof
  • Lemma B.2
  • proof
  • Lemma B.3
  • proof
  • Lemma B.4
  • proof