Table of Contents
Fetching ...

PAFT: Prompt-Agnostic Fine-Tuning

Chenxing Wei, Yao Shu, Mingwen Ou, Ying Tiffany He, Fei Richard Yu

TL;DR

PAFT addresses the problem of prompt-induced brittleness in fine-tuning large language models by introducing a two-phase framework that constructs a diverse set of synthetic prompts and trains with dynamic prompt variation. Through candidate prompt construction using an LLM ensemble and a dual prompting strategy, followed by a dynamic fine-tuning process, PAFT learns task semantics that generalize across unseen prompts. Empirical results show PAFT yields higher generalization to unseen prompts (up to 7% improvement) and stronger downstream performance across QA, reasoning, and tool use tasks, alongside up to 3.2× faster inference due to reduced prompt sensitivity. The approach is supported by theoretical insights linking prompt diversity to improved cross-domain generalization via domain adaptation bounds and MMD-based discrepancy controls, underscoring PAFT’s practical impact for robust, prompt-agnostic LLM deployment.

Abstract

Fine-tuning large language models (LLMs) often causes overfitting to specific prompt wording, where minor phrasing variations drastically reduce performance. To address this, we propose Prompt-Agnostic Fine-Tuning (PAFT), a method that enhances robustness through dynamic prompt variation during training. PAFT first generates diverse synthetic prompts, then continuously samples from this set to construct training instances, forcing models to learn fundamental task principles rather than surface-level patterns. Across systematic evaluations using both supervised fine-tuning (SFT) and reinforcement learning fine-tuning (RLFT), PAFT demonstrates substantially improved prompt robustness, achieving 7% higher generalization accuracy on unseen prompts than standard methods. In addition to enhanced robustness, PAFT consistently yields superior overall performance on established benchmarks for question answering, mathematical reasoning, and tool use. Notably, models trained with PAFT attain 3.2 faster inference speeds due to reduced prompt sensitivity. Ablation studies further validate effectiveness of PAFT, while theoretical analysis reveals that PAFT can effectively enhance the cross-domain generalization ability of LLM.

PAFT: Prompt-Agnostic Fine-Tuning

TL;DR

PAFT addresses the problem of prompt-induced brittleness in fine-tuning large language models by introducing a two-phase framework that constructs a diverse set of synthetic prompts and trains with dynamic prompt variation. Through candidate prompt construction using an LLM ensemble and a dual prompting strategy, followed by a dynamic fine-tuning process, PAFT learns task semantics that generalize across unseen prompts. Empirical results show PAFT yields higher generalization to unseen prompts (up to 7% improvement) and stronger downstream performance across QA, reasoning, and tool use tasks, alongside up to 3.2× faster inference due to reduced prompt sensitivity. The approach is supported by theoretical insights linking prompt diversity to improved cross-domain generalization via domain adaptation bounds and MMD-based discrepancy controls, underscoring PAFT’s practical impact for robust, prompt-agnostic LLM deployment.

Abstract

Fine-tuning large language models (LLMs) often causes overfitting to specific prompt wording, where minor phrasing variations drastically reduce performance. To address this, we propose Prompt-Agnostic Fine-Tuning (PAFT), a method that enhances robustness through dynamic prompt variation during training. PAFT first generates diverse synthetic prompts, then continuously samples from this set to construct training instances, forcing models to learn fundamental task principles rather than surface-level patterns. Across systematic evaluations using both supervised fine-tuning (SFT) and reinforcement learning fine-tuning (RLFT), PAFT demonstrates substantially improved prompt robustness, achieving 7% higher generalization accuracy on unseen prompts than standard methods. In addition to enhanced robustness, PAFT consistently yields superior overall performance on established benchmarks for question answering, mathematical reasoning, and tool use. Notably, models trained with PAFT attain 3.2 faster inference speeds due to reduced prompt sensitivity. Ablation studies further validate effectiveness of PAFT, while theoretical analysis reveals that PAFT can effectively enhance the cross-domain generalization ability of LLM.

Paper Structure

This paper contains 27 sections, 1 theorem, 4 equations, 9 figures, 11 tables, 1 algorithm.

Key Result

Proposition 1

The discrepancy term $\text{Disc}(\mathcal{P}, \mathcal{Q})$ can be bounded by the Maximum Mean Discrepancy (MMD) gao2021maximummeandiscrepancytest upper bound: where MMD is defined as:

Figures (9)

  • Figure 1: This figure shows how minor prompt changes drastically impact model accuracy. For instance, a one-word alteration to a prompt for the same user question reduced dataset accuracy from 86.27% to 66.93%. This highlights severe performance swings in models lacking prompt robustness.
  • Figure 2: An overview of PAFT: This figure contrasts SFT with PAFT. While SFT relies on fixed datasets and predefined prompts—limiting robustness and cross-prompt generalization— PAFT employs dynamic prompt selection during training, significantly enhancing prompt robustness and generalization capabilities. By leveraging commercial LLMs to generate diverse candidate prompts, PAFT delivers a more scalable and generalizable solution for large language model adaptation.
  • Figure 3: This figure presents experimental results across four datasets comparing base and SFT model performance on 450 diverse prompts (both human-written and LLM-generated). Probability distribution plots reveal that despite SFT's overall accuracy improvements, substantial performance variability persists—certain prompts yield markedly lower accuracy, with high standard deviations indicating significant prompt-dependent fluctuations. These findings underscore crucial impact of prompt and demonstrate the necessity for prompt-agnostic fine-tuning approaches.
  • Figure 4: As a visual comparison to Figure \ref{['fig:prompt_impact']}, we present performance distributions of base models, SFT models, and PAFT across multiple reasoning and reading comprehension tasks. The probability distribution plots illustrate performance on unseen test prompts (both human-written and LLM-generated) not used during PAFT training. Results clearly demonstrate PAFT consistently achieves higher accuracy and lower variance across all tasks, confirming its effectiveness in enhancing prompt robustness.
  • Figure 5: The performance of TopAccuracy, User-specified, BATprompt, ZOPO, and PAFT models is compared on multiple reasoning and reading comprehension tasks. Results are reported in terms of their correct distribution. The tests are conducted on a test set of 50 unseen prompts, different from the ones used in training. The PAFT model shows superior performance compared to other baselines, achieving higher accuracy and lower variance in all tasks.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Proposition 1: MMD as an Upper Bound on Discrepancy
  • proof : Proof Sketch