Table of Contents
Fetching ...

Balancing the Budget: Understanding Trade-offs Between Supervised and Preference-Based Finetuning

Mohit Raghavendra, Junmo Kang, Alan Ritter

TL;DR

The paper analyzes how to allocate a fixed data-annotation budget between Supervised Finetuning (SFT) and Preference Finetuning (PFT) in post-training of small- to medium-sized LLMs. It finds that SFT dominates performance in very low-data settings, while a hybrid SFT+PFT pipeline becomes advantageous with larger budgets, often favoring more preference data. A notable cold-start problem emerges when applying PFT directly to the base model, which can be mitigated by allocating a small portion of the budget to SFT first, yielding substantial gains on tasks requiring structured reasoning like GSM8k. The study provides actionable guidance on budget-aware data collection, showing how costs of SFT vs PFT influence the optimal mix and highlighting when SFT is essential for enabling effective PFT."

Abstract

Post-training of Large Language Models often involves a pipeline of Supervised Finetuning (SFT) followed by Preference Finetuning (PFT) using methods like Direct Preference Optimization. Both stages require annotated data that are very different in structure and costs. We study how to optimally allocate a fixed training data budget between the two stages, through extensive experiments spanning four diverse tasks, multiple model sizes and various data annotation costs. Our findings reveal that just SFT on the base model dominates performance in low-data regimes ($<1,000$ annotated examples). With larger data-budgets, we observe that a combination of SFT and PFT, often with increasing portions allocated towards preference data yields optimal performance. However, completely eliminating SFT and running PFT directly on the base model yields suboptimal performance, described as the cold start problem on tasks like mathematics. We observe that this is due to the distribution shift arising from using DPO directly on the base model to elicit step-by-step reasoning. This limitation can be effectively addressed by allocating even a small portion ($<10$%) of the budget to SFT first, resulting in performance improvements of $15-20$% on analytical benchmarks like GSM8k. These results provide actionable insights for researchers and practitioners optimizing model development under budget constraints, where high-quality data curation often represents a significant portion of the total costs of model development.

Balancing the Budget: Understanding Trade-offs Between Supervised and Preference-Based Finetuning

TL;DR

The paper analyzes how to allocate a fixed data-annotation budget between Supervised Finetuning (SFT) and Preference Finetuning (PFT) in post-training of small- to medium-sized LLMs. It finds that SFT dominates performance in very low-data settings, while a hybrid SFT+PFT pipeline becomes advantageous with larger budgets, often favoring more preference data. A notable cold-start problem emerges when applying PFT directly to the base model, which can be mitigated by allocating a small portion of the budget to SFT first, yielding substantial gains on tasks requiring structured reasoning like GSM8k. The study provides actionable guidance on budget-aware data collection, showing how costs of SFT vs PFT influence the optimal mix and highlighting when SFT is essential for enabling effective PFT."

Abstract

Post-training of Large Language Models often involves a pipeline of Supervised Finetuning (SFT) followed by Preference Finetuning (PFT) using methods like Direct Preference Optimization. Both stages require annotated data that are very different in structure and costs. We study how to optimally allocate a fixed training data budget between the two stages, through extensive experiments spanning four diverse tasks, multiple model sizes and various data annotation costs. Our findings reveal that just SFT on the base model dominates performance in low-data regimes ( annotated examples). With larger data-budgets, we observe that a combination of SFT and PFT, often with increasing portions allocated towards preference data yields optimal performance. However, completely eliminating SFT and running PFT directly on the base model yields suboptimal performance, described as the cold start problem on tasks like mathematics. We observe that this is due to the distribution shift arising from using DPO directly on the base model to elicit step-by-step reasoning. This limitation can be effectively addressed by allocating even a small portion (%) of the budget to SFT first, resulting in performance improvements of % on analytical benchmarks like GSM8k. These results provide actionable insights for researchers and practitioners optimizing model development under budget constraints, where high-quality data curation often represents a significant portion of the total costs of model development.

Paper Structure

This paper contains 36 sections, 3 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: An illustration of the choices that introduce the data-allocation trade-off in LLM post-training. Given a fixed limited budget, one has to decide how much to allocate for annotating SFT data and how much for preference annotation (PFT data).
  • Figure 2: Effect of varying the of SFT-PFT data mix on the performance of Llama3.1-8B (top) and Qwen2.5-7B (bottom) base models. The x-axis represents the number of training examples (data budget), and the y-axis represents performance, measured using different task-specific metrics. The ratios represent the fraction of the training data allocated for SFT, and the rest is for Preference Finetuning. The orange line shows the performance when trained using only SFT data (1.0 ratio). The subsequent darkening red-shaded lines indicate decreasing proportions of SFT data in the training set, all the way till using only PFT data directly on the base model (0.0 ratio).
  • Figure 3: Comparision of SFT against KTO method of PFT. We notice similar relative performance scaling patterns as SFT vs DPO highlighting the general trends in preference data-based finetuning against SFT.
  • Figure 4: Scaling patterns of SFT and PFT (using DPO) directly on the Llama3 models - 8B(top), 3B (middle) and 1B(bottom). We observe that SFT shows a consistent improvement on the task across all model sizes. However, directly applying PFT shows improvements only in large data-regimes, and only in larger model sizes.
  • Figure 5: Performance of DPO and KTO models with decreasing SFT data ratio of $0.1$, $0.01$ (highlighted in red dotted lines) and $0$ (Pure PFT), for the same total data budget. We see that even a minimal amount of SFT can have outsized benefits in both cases, with the improvements being more drastic in analytical tasks like math compared to gradual improvements in stylistic tasks like instruction following.
  • ...and 6 more figures