Table of Contents
Fetching ...

Instruction Fine-Tuning: Does Prompt Loss Matter?

Mathew Huerta-Enochian, Seung Yong Ko

TL;DR

It is found that performance of models fine-tuned on short-completion data had a statistically-significant negative quadratic relationship with PLW, and this research serves as a warning to API providers about the importance of providing a PLW parameter for SIFT.

Abstract

We present a novel study analyzing the effects of various prompt loss token weights (PLW) for supervised instruction fine-tuning (SIFT). While prompt-masking (PLW = 0) is common for SIFT, some fine-tuning APIs support fractional PLWs and suggest that using a small non-zero PLW can help stabilize learning when fine-tuning on short-completion data. However, there has never been a study confirming this claim, and OpenAI, a major cloud-based SIFT provider, recently removed this parameter from their fine-tuning API. We found that performance of models fine-tuned on short-completion data had a statistically-significant negative quadratic relationship with PLW. Using small values (0.01 - 0.5) of PLW produced better results on multiple-choice and short-generation benchmarks (outperforming models fine-tuned on long-completion data) while large values (~ 1.0) of PLW produced better results on long-generation benchmarks. We explained this effect and verified its importance through additional experiments. This research serves as a warning to API providers about the importance of providing a PLW parameter for SIFT.

Instruction Fine-Tuning: Does Prompt Loss Matter?

TL;DR

It is found that performance of models fine-tuned on short-completion data had a statistically-significant negative quadratic relationship with PLW, and this research serves as a warning to API providers about the importance of providing a PLW parameter for SIFT.

Abstract

We present a novel study analyzing the effects of various prompt loss token weights (PLW) for supervised instruction fine-tuning (SIFT). While prompt-masking (PLW = 0) is common for SIFT, some fine-tuning APIs support fractional PLWs and suggest that using a small non-zero PLW can help stabilize learning when fine-tuning on short-completion data. However, there has never been a study confirming this claim, and OpenAI, a major cloud-based SIFT provider, recently removed this parameter from their fine-tuning API. We found that performance of models fine-tuned on short-completion data had a statistically-significant negative quadratic relationship with PLW. Using small values (0.01 - 0.5) of PLW produced better results on multiple-choice and short-generation benchmarks (outperforming models fine-tuned on long-completion data) while large values (~ 1.0) of PLW produced better results on long-generation benchmarks. We explained this effect and verified its importance through additional experiments. This research serves as a warning to API providers about the importance of providing a PLW parameter for SIFT.
Paper Structure (36 sections, 2 equations, 10 figures, 6 tables)

This paper contains 36 sections, 2 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Performance by transformed PLW. (a) A simple performance aggregate score (the unweighted mean of benchmark scores). (b), (c), (d) Relative aggregate performance scores where scores per task for each task and group are min-max scaled to show common trends, regardless of scale. Note that aggregate scores for only the AlpacaDataShort models show a relationship with transformed PLW. Best viewed in color.
  • Figure 2: Analysis of causal mechanism. Boxplots use the 0.25, 0.5, and 0.75 quantiles with whiskers at 0.09 and 0.91 quantiles. Best viewed in color. (a) Training Loss Stability: Relative Standard Deviation (RSD) of five-step training loss windows show increase instability for small (non-zero) PLWs. (b) Weight Distance: Distance between learned weights and PTLM weights is smaller for small (non-zero) PLWs. (c) Train Data Memorization: Completion Sacre BLEU scores on training data prompts as an indicator for overfitting. (d) AE Generation Length: Generation lengths on the Alpaca Eval test set for varying PLW values.
  • Figure 3: Relative aggregate scores showing the effects of PLW for SIFT on alternative datasets. (a) UltraFeedbackCleaned and DatabricksDolly models. (b) UltraFeedbackShort and DatabricksDollyShort models.
  • Figure 4: Examples of modifying prompt-completion ratios using prompt inversion, best viewed in color. To prompt-invert instances, we re-frame the prompt-completion task as an original-prompt-prediction task. I.e., we teach the model to predict the original instruction given an example completion and optional input. In the first example above, prompt inversion changes the instance's word-based completion-prompt ratio $R_g$ from $34/(7+0)=4.857$ to $7/(9+34)=0.163$.
  • Figure 5: Group I benchmark performance. Note the negative quadratic relationship with transformed PLW.
  • ...and 5 more figures