Table of Contents
Fetching ...

ASFT: Aligned Supervised Fine-Tuning through Absolute Likelihood

Ruoyu Wang, Jiachen Sun, Shaowei Hua, Quan Fang

TL;DR

Aligned Supervised Fine-Tuning (ASFT) is proposed, an effective approach that better aligns LLMs with pair-wise datasets by optimizing absolute likelihood for each response, rather than using the Bradley-Terry model, and eliminates the need for a reference model.

Abstract

Direct Preference Optimization (DPO) is a method for enhancing model performance by directly optimizing for the preferences or rankings of outcomes, instead of traditional loss functions. This approach has proven effective in aligning Large Language Models (LLMs) with human preferences. Despite its widespread use across various tasks, DPO has been criticized for its sensitivity to the effectiveness of Supervised Fine-Tuning (SFT) and its limitations in enabling models to learn human-preferred responses, leading to less satisfactory performance. To address these limitations, we propose Aligned Supervised Fine-Tuning (ASFT), an effective approach that better aligns LLMs with pair-wise datasets by optimizing absolute likelihood for each response, rather than using the Bradley-Terry model, and eliminates the need for a reference model. Through theoretical gradient analysis, we demonstrate that ASFT mitigates the issue where the DPO loss function decreases the probability of generating human-dispreferred data at a faster rate than it increases the probability of producing preferred data. Additionally, we compare ASFT to DPO and its latest variants, such as the single-step approach ORPO, using the latest instruction-tuned model Llama3, which has been fine-tuned on UltraFeedback and HH-RLHF. We evaluated performance on instruction-following benchmarks like MT-Bench and traditional text generation metrics such as BLEU-4 and ROUGE-L. Extensive experiments demonstrate that ASFT is an effective alignment approach, consistently outperforming existing methods.

ASFT: Aligned Supervised Fine-Tuning through Absolute Likelihood

TL;DR

Aligned Supervised Fine-Tuning (ASFT) is proposed, an effective approach that better aligns LLMs with pair-wise datasets by optimizing absolute likelihood for each response, rather than using the Bradley-Terry model, and eliminates the need for a reference model.

Abstract

Direct Preference Optimization (DPO) is a method for enhancing model performance by directly optimizing for the preferences or rankings of outcomes, instead of traditional loss functions. This approach has proven effective in aligning Large Language Models (LLMs) with human preferences. Despite its widespread use across various tasks, DPO has been criticized for its sensitivity to the effectiveness of Supervised Fine-Tuning (SFT) and its limitations in enabling models to learn human-preferred responses, leading to less satisfactory performance. To address these limitations, we propose Aligned Supervised Fine-Tuning (ASFT), an effective approach that better aligns LLMs with pair-wise datasets by optimizing absolute likelihood for each response, rather than using the Bradley-Terry model, and eliminates the need for a reference model. Through theoretical gradient analysis, we demonstrate that ASFT mitigates the issue where the DPO loss function decreases the probability of generating human-dispreferred data at a faster rate than it increases the probability of producing preferred data. Additionally, we compare ASFT to DPO and its latest variants, such as the single-step approach ORPO, using the latest instruction-tuned model Llama3, which has been fine-tuned on UltraFeedback and HH-RLHF. We evaluated performance on instruction-following benchmarks like MT-Bench and traditional text generation metrics such as BLEU-4 and ROUGE-L. Extensive experiments demonstrate that ASFT is an effective alignment approach, consistently outperforming existing methods.
Paper Structure (21 sections, 2 theorems, 25 equations, 3 figures, 3 tables)

This paper contains 21 sections, 2 theorems, 25 equations, 3 figures, 3 tables.

Key Result

Theorem A.1

The partial derivatives of $\mathcal{L}_{R}\left(x_1, x_2\right)=-\log \left(\frac{x_1^\beta}{x_1^\beta+x_2^\beta}\right)$ with respect to $x_1$ and $x_2$ are given by:

Figures (3)

  • Figure 1: Figures (a) and Figure (b) on the left depict the values of the loss function when preferences and non-preferences responses are generated at different probabilities. On the right, the figures provide a top-down view of the optimization plane of the loss function, showing the gradient field. This is marked with red arrows at various positional points, indicating the direction and magnitude of optimization. The direction of these red arrows represents the path of gradient-based optimization, while the length of the arrows indicates the strength of the optimization.
  • Figure 2: Comparison measured on UltraFeedback. (a) Likelihood margin between ASFT and ORPO. (b) Total runtime and memory usage for ASFT and DPO
  • Figure 3: A case study on mtbench, the Llama3-Instruct model trained by ASFT was used to give high quality responses to input, and GPT-4 was used as the evaluation model to give results.

Theorems & Definitions (4)

  • Theorem A.1
  • proof
  • Theorem A.2
  • proof