Table of Contents
Fetching ...

Intermediate direct preference optimization

Atsushi Kojima

TL;DR

The paper addresses the limitation of standard Direct Preference Optimization (DPO) by introducing Intermediate DPO, an auxiliary loss computed at selected intermediate transformer layers and averaged to guide fine-tuning. The final objective combines the traditional DPO loss with the intermediate loss via a weighted sum, while inference uses the final-layer logits as in standard DPO. Empirical results on the ultrafeedback dataset with a 7B SFT model show that computing the intermediate DPO loss at layer 22 (K=22) yields substantial gains, achieving 67.5% win-rate against SFT and 52.5% against DPO, with further gains when selecting multiple dispersed layers. The findings highlight that layer position and selection strategy significantly impact performance, suggesting practical guidance for applying layer-wise auxiliary losses to improve alignment of LLMs with human preferences.

Abstract

We propose the intermediate direct preference optimization (DPO) method to calculate the DPO loss at selected intermediate layers as an auxiliary loss for finetuning large language models (LLMs). The conventional DPO method fine-tunes a supervised fine-tuning (SFT) model by calculating the DPO loss using logits from the final layer. In our intermediate DPO approach, DPO losses are calculated using the logits from K-selected intermediate layers and averaged to obtain the intermediate DPO loss. For training the intermediate DPO model, the final loss is obtained by calculating the weighted sum of the DPO and intermediate DPO losses. During inference, the intermediate DPO model decodes using the final layer logits similarly to the conventional DPO model. In experiments using the ultrafeedback dataset, the performance of the intermediate DPO model was evaluated using GPT-4. As a result, the intermediate DPO model trained using the intermediate DPO loss calculated at the 22nd layer of a 32-layer SFT model achieved win rates of 52.5% and 67.5% against the conventional DPO and SFT models, respectively, demonstrating the effectiveness of the proposed method. Furthermore, we report the relationships among the position of the selected intermediate layers, the number of layers, and performance.

Intermediate direct preference optimization

TL;DR

The paper addresses the limitation of standard Direct Preference Optimization (DPO) by introducing Intermediate DPO, an auxiliary loss computed at selected intermediate transformer layers and averaged to guide fine-tuning. The final objective combines the traditional DPO loss with the intermediate loss via a weighted sum, while inference uses the final-layer logits as in standard DPO. Empirical results on the ultrafeedback dataset with a 7B SFT model show that computing the intermediate DPO loss at layer 22 (K=22) yields substantial gains, achieving 67.5% win-rate against SFT and 52.5% against DPO, with further gains when selecting multiple dispersed layers. The findings highlight that layer position and selection strategy significantly impact performance, suggesting practical guidance for applying layer-wise auxiliary losses to improve alignment of LLMs with human preferences.

Abstract

We propose the intermediate direct preference optimization (DPO) method to calculate the DPO loss at selected intermediate layers as an auxiliary loss for finetuning large language models (LLMs). The conventional DPO method fine-tunes a supervised fine-tuning (SFT) model by calculating the DPO loss using logits from the final layer. In our intermediate DPO approach, DPO losses are calculated using the logits from K-selected intermediate layers and averaged to obtain the intermediate DPO loss. For training the intermediate DPO model, the final loss is obtained by calculating the weighted sum of the DPO and intermediate DPO losses. During inference, the intermediate DPO model decodes using the final layer logits similarly to the conventional DPO model. In experiments using the ultrafeedback dataset, the performance of the intermediate DPO model was evaluated using GPT-4. As a result, the intermediate DPO model trained using the intermediate DPO loss calculated at the 22nd layer of a 32-layer SFT model achieved win rates of 52.5% and 67.5% against the conventional DPO and SFT models, respectively, demonstrating the effectiveness of the proposed method. Furthermore, we report the relationships among the position of the selected intermediate layers, the number of layers, and performance.
Paper Structure (8 sections, 3 equations, 3 figures, 4 tables)

This paper contains 8 sections, 3 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of intermediate DPO
  • Figure 2: Win rates of the intermediate DPO model against the SFT model and DPO model ($K=11, 22$)
  • Figure 3: Comparison of likelihood from intermediate layers