Table of Contents
Fetching ...

A Deep Dive into the Trade-Offs of Parameter-Efficient Preference Alignment Techniques

Megh Thakkar, Quentin Fournier, Matthew D Riemer, Pin-Yu Chen, Amal Zouaq, Payel Das, Sarath Chandar

TL;DR

The paper investigates the trade-offs of parameter-efficient preference alignment for large language models by systematically varying three core axes: alignment dataset quality/quantity, alignment method, and base model type. Through over 300 experiments using LoRA and QLoRA across Harmlessness and Helpfulness preferences with HH-RLHF and BeaverTails, it identifies that data quality generally boosts alignment, DPO yields higher fidelity for instruction-tuned bases, and mixtures of preferences often degrade performance. It also shows that pre-trained models benefit more from SFT while instruction-tuned models benefit from DPO, and that model merging can mitigate trade-offs between conflicting preferences. The findings culminate in practical guidelines for researchers to perform more effective parameter-efficient LLM alignment and suggest avenues for expanding PEFT approaches and preferences in future work.

Abstract

Large language models are first pre-trained on trillions of tokens and then instruction-tuned or aligned to specific preferences. While pre-training remains out of reach for most researchers due to the compute required, fine-tuning has become affordable thanks to parameter-efficient methods such as LoRA and QLoRA. Alignment is known to be sensitive to the many factors involved, including the quantity and quality of data, the alignment method, and the adapter rank. However, there has not yet been an extensive study of their effect on downstream performance. To address this gap, we conduct an in-depth investigation of the impact of popular choices for three crucial axes: (i) the alignment dataset (HH-RLHF and BeaverTails), (ii) the alignment technique (SFT and DPO), and (iii) the model (LLaMA-1, Vicuna-v1.3, Mistral-7b, and Mistral-7b-Instruct). Our extensive setup spanning over 300 experiments reveals consistent trends and unexpected findings. We observe how more informative data helps with preference alignment, cases where supervised fine-tuning outperforms preference optimization, and how aligning to a distinct preference boosts performance on downstream tasks. Through our in-depth analyses, we put forward key guidelines to help researchers perform more effective parameter-efficient LLM alignment.

A Deep Dive into the Trade-Offs of Parameter-Efficient Preference Alignment Techniques

TL;DR

The paper investigates the trade-offs of parameter-efficient preference alignment for large language models by systematically varying three core axes: alignment dataset quality/quantity, alignment method, and base model type. Through over 300 experiments using LoRA and QLoRA across Harmlessness and Helpfulness preferences with HH-RLHF and BeaverTails, it identifies that data quality generally boosts alignment, DPO yields higher fidelity for instruction-tuned bases, and mixtures of preferences often degrade performance. It also shows that pre-trained models benefit more from SFT while instruction-tuned models benefit from DPO, and that model merging can mitigate trade-offs between conflicting preferences. The findings culminate in practical guidelines for researchers to perform more effective parameter-efficient LLM alignment and suggest avenues for expanding PEFT approaches and preferences in future work.

Abstract

Large language models are first pre-trained on trillions of tokens and then instruction-tuned or aligned to specific preferences. While pre-training remains out of reach for most researchers due to the compute required, fine-tuning has become affordable thanks to parameter-efficient methods such as LoRA and QLoRA. Alignment is known to be sensitive to the many factors involved, including the quantity and quality of data, the alignment method, and the adapter rank. However, there has not yet been an extensive study of their effect on downstream performance. To address this gap, we conduct an in-depth investigation of the impact of popular choices for three crucial axes: (i) the alignment dataset (HH-RLHF and BeaverTails), (ii) the alignment technique (SFT and DPO), and (iii) the model (LLaMA-1, Vicuna-v1.3, Mistral-7b, and Mistral-7b-Instruct). Our extensive setup spanning over 300 experiments reveals consistent trends and unexpected findings. We observe how more informative data helps with preference alignment, cases where supervised fine-tuning outperforms preference optimization, and how aligning to a distinct preference boosts performance on downstream tasks. Through our in-depth analyses, we put forward key guidelines to help researchers perform more effective parameter-efficient LLM alignment.
Paper Structure (51 sections, 6 figures, 6 tables)

This paper contains 51 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Performance comparison for helpful and harmless benchmarks when models are aligned using QLoRA over HH-RLHF (in red) and BeaverTails (in blue). We observe better performance when using a more informative and high-quality preference alignment dataset, albeit it is often overfitting for non-instruction tuned models when aligned using DPO (Section \ref{['subsec:hh_vs_bv']}).
  • Figure 2: Performance trends w.r.t number of samples of HH-RLHf and BeaverTails used for SFT alignment (Section \ref{['subsec:hh_vs_bv']}). Models aligned with a higher-quality dataset seem to learn faster or regress slower.
  • Figure 3: Relationship of the number of samples used for alignment using SFT and DPO with Mistral (Section \ref{['subsec:num_samples']}). The performance here is shown in % relative to the performance when using 1600 samples.
  • Figure 4: Comparing the downstream performance when aligning using SFT (in light blue) and DPO (in pink) with QLoRA. SFT outperforms DPO generally when used over pre-trained models, significantly for instruction following tasks. DPO is more faithful to explicit preferences such as harmlessness and performs significantly better for instruction-tuned models (Section \ref{['sec:sft_vs_dpo']}).
  • Figure 5: Comparing the effect of applying alignment methods on pre-trained models with instruction-tuned models using LLaMA-1 (Section \ref{['subsec:pretrain_vs_instruction']}). SFT helps more for pre-trained models, while DPO helps more for instruction-tuned models. However, when aligning to objective preferences like harmlessness, DPO leads to more faithful alignment across both pre-trained and instruction-tuned models.
  • ...and 1 more figures