PAFT: A Parallel Training Paradigm for Effective LLM Fine-Tuning

Shiva Kumar Pentyala; Zhichao Wang; Bin Bi; Kiran Ramnath; Xiang-Bo Mao; Regunathan Radhakrishnan; Sitaram Asur; Na; Cheng

PAFT: A Parallel Training Paradigm for Effective LLM Fine-Tuning

Shiva Kumar Pentyala, Zhichao Wang, Bin Bi, Kiran Ramnath, Xiang-Bo Mao, Regunathan Radhakrishnan, Sitaram Asur, Na, Cheng

TL;DR

The paper addresses the alignment tax that arises when LLMs are fine-tuned in a sequential manner (SFT followed by preference alignment). It introduces PAFT, a parallel training paradigm that independently optimizes $\delta_\mathrm{sft}$ and $\delta_\mathrm{xpo}$ from the same $\theta_\mathrm{pre}$ and then merges them via a merging function $\theta_\mathrm{merge}=f(\theta_\mathrm{pre},\delta_\mathrm{dpo},\delta_\mathrm{sft})$, with sparsity induced on $\delta_\mathrm{sft}$ using $L1$ regularization to reduce interference. The authors show that sparse delta parameters merge more effectively using methods like TIES and Task Arithmetic, achieving state-of-the-art results on Open LLM Leaderboard (e.g., PAFT-70B ranked top globally) and strong AlpacaEval performance, demonstrating generality across models and preference alignment techniques (DPO and ORPO). Overall, PAFT provides a scalable approach to leverage both SFT and human preference data without the typical performance loss from sequential training, enabling robust, high-performing LLM fine-tuning in practical settings.

Abstract

Large language models (LLMs) have shown remarkable abilities in diverse natural language processing (NLP) tasks. The LLMs generally undergo supervised fine-tuning (SFT) followed by preference alignment to be usable in downstream applications. However, this sequential training pipeline leads to alignment tax that degrades the LLM performance. This paper introduces PAFT, a new PArallel training paradigm for effective LLM Fine-Tuning, which independently performs SFT and preference alignment (e.g., DPO and ORPO, etc.) with the same pre-trained model on respective datasets. The model produced by SFT and the model from preference alignment are then merged into a final model by parameter fusing for use in downstream applications. This work reveals important findings that preference alignment like DPO naturally results in a sparse model while SFT leads to a natural dense model which needs to be sparsified for effective model merging. This paper introduces an effective interference resolution which reduces the redundancy by sparsifying the delta parameters. The LLM resulted from the new training paradigm achieved Rank #1 on the HuggingFace Open LLM Leaderboard. Comprehensive evaluation shows the effectiveness of the parallel training paradigm.

PAFT: A Parallel Training Paradigm for Effective LLM Fine-Tuning

TL;DR

and

from the same

and then merges them via a merging function

, with sparsity induced on

using

regularization to reduce interference. The authors show that sparse delta parameters merge more effectively using methods like TIES and Task Arithmetic, achieving state-of-the-art results on Open LLM Leaderboard (e.g., PAFT-70B ranked top globally) and strong AlpacaEval performance, demonstrating generality across models and preference alignment techniques (DPO and ORPO). Overall, PAFT provides a scalable approach to leverage both SFT and human preference data without the typical performance loss from sequential training, enabling robust, high-performing LLM fine-tuning in practical settings.

Abstract

Paper Structure (17 sections, 2 equations, 2 figures, 4 tables)

This paper contains 17 sections, 2 equations, 2 figures, 4 tables.

Introduction
Methodology
Problem Setting
Parallel Training
Sparse Merging
Experiments
Evaluation Settings
Parallel Training vs. Sequential Training
Sparse Merging vs. Dense Merging
Comparison with State-of-the-art LLMs
Related Work
SFT and Human Preference Alignment
Sparsity for LLMs
Model Merging
Conclusions
...and 2 more sections

Figures (2)

Figure 1: Comparison of training paradigms
Figure 2: Adapter sparsity for SFT and DPO. The sparsity levels are computed by first merging the parameters from LoRA matrices $\delta_A$ and $\delta_B$ through matrix multiplication ($\delta = \delta_B \times \delta_A$), and computing the percentage of elements within $\delta$ that are less than a threshold of $1 \times e^{-5}$, indicating the proportion of weights approaching zero. The reported sparsity is the average across all layers.

PAFT: A Parallel Training Paradigm for Effective LLM Fine-Tuning

TL;DR

Abstract

PAFT: A Parallel Training Paradigm for Effective LLM Fine-Tuning

Authors

TL;DR

Abstract

Table of Contents

Figures (2)