Balancing the Budget: Understanding Trade-offs Between Supervised and Preference-Based Finetuning

Mohit Raghavendra; Junmo Kang; Alan Ritter

Balancing the Budget: Understanding Trade-offs Between Supervised and Preference-Based Finetuning

Mohit Raghavendra, Junmo Kang, Alan Ritter

TL;DR

The paper analyzes how to allocate a fixed data-annotation budget between Supervised Finetuning (SFT) and Preference Finetuning (PFT) in post-training of small- to medium-sized LLMs. It finds that SFT dominates performance in very low-data settings, while a hybrid SFT+PFT pipeline becomes advantageous with larger budgets, often favoring more preference data. A notable cold-start problem emerges when applying PFT directly to the base model, which can be mitigated by allocating a small portion of the budget to SFT first, yielding substantial gains on tasks requiring structured reasoning like GSM8k. The study provides actionable guidance on budget-aware data collection, showing how costs of SFT vs PFT influence the optimal mix and highlighting when SFT is essential for enabling effective PFT."

Abstract

Post-training of Large Language Models often involves a pipeline of Supervised Finetuning (SFT) followed by Preference Finetuning (PFT) using methods like Direct Preference Optimization. Both stages require annotated data that are very different in structure and costs. We study how to optimally allocate a fixed training data budget between the two stages, through extensive experiments spanning four diverse tasks, multiple model sizes and various data annotation costs. Our findings reveal that just SFT on the base model dominates performance in low-data regimes ($<1,000$ annotated examples). With larger data-budgets, we observe that a combination of SFT and PFT, often with increasing portions allocated towards preference data yields optimal performance. However, completely eliminating SFT and running PFT directly on the base model yields suboptimal performance, described as the cold start problem on tasks like mathematics. We observe that this is due to the distribution shift arising from using DPO directly on the base model to elicit step-by-step reasoning. This limitation can be effectively addressed by allocating even a small portion ($<10$%) of the budget to SFT first, resulting in performance improvements of $15-20$% on analytical benchmarks like GSM8k. These results provide actionable insights for researchers and practitioners optimizing model development under budget constraints, where high-quality data curation often represents a significant portion of the total costs of model development.

Balancing the Budget: Understanding Trade-offs Between Supervised and Preference-Based Finetuning

TL;DR

Abstract

Balancing the Budget: Understanding Trade-offs Between Supervised and Preference-Based Finetuning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)