Table of Contents
Fetching ...

Patch the Distribution Mismatch: RL Rewriting Agent for Stable Off-Policy SFT

Jiacheng Wang, Ping Jian, Zhen Yang, Zirong Chen, Keren Liao, Zhongbin Guo

TL;DR

This work addresses catastrophic forgetting during downstream supervised fine-tuning under distribution shift by proposing a data-centric solution: an RL-based rewriting agent that modifies supervision targets to better align with the backbone’s QA-style generation while preserving diversity. The rewriting policy is parameter-efficient (LoRA on a frozen base model) and trained with Group Relative Policy Optimization to jointly optimize task-consistency, distributional alignment, and diversity under a hard feasibility gate. A Generate--Verify--Fallback pipeline constructs rewrites that pass feasibility checks and are then used for standard SFT, yielding downstream gains comparable to vanilla SFT but with substantially reduced forgetting on general-domain benchmarks. Across multiple backbones and math-focused datasets, the approach demonstrates a favorable gain-forgetting trade-off, suggesting that data-centric rewriting guided by RL can stabilize fine-tuning under distribution shifts and improve robustness in downstream tasks.

Abstract

Large language models (LLMs) have made rapid progress, yet adapting them to downstream scenarios still commonly relies on supervised fine-tuning (SFT). When downstream data exhibit a substantial distribution shift from the model's prior training distribution, SFT can induce catastrophic forgetting. To narrow this gap, data rewriting has been proposed as a data-centric approach that rewrites downstream training data prior to SFT. However, existing methods typically sample rewrites from a prompt-induced conditional distribution, so the resulting targets are not necessarily aligned with the model's natural QA-style generation distribution. Moreover, reliance on fixed templates can lead to diversity collapse. To address these issues, we cast data rewriting as a policy learning problem and learn a rewriting policy that better matches the backbone's QA-style generation distribution while preserving diversity. Since distributional alignment, diversity and task consistency are automatically evaluable but difficult to optimize end-to-end with differentiable objectives, we leverage reinforcement learning to optimize the rewrite distribution under reward feedback and propose an RL-based data-rewriting agent. The agent jointly optimizes QA-style distributional alignment and diversity under a hard task-consistency gate, thereby constructing a higher-quality rewritten dataset for downstream SFT. Extensive experiments show that our method achieves downstream gains comparable to standard SFT while reducing forgetting on non-downstream benchmarks by 12.34% on average. Our code is available at https://anonymous.4open.science/r/Patch-the-Prompt-Gap-4112 .

Patch the Distribution Mismatch: RL Rewriting Agent for Stable Off-Policy SFT

TL;DR

This work addresses catastrophic forgetting during downstream supervised fine-tuning under distribution shift by proposing a data-centric solution: an RL-based rewriting agent that modifies supervision targets to better align with the backbone’s QA-style generation while preserving diversity. The rewriting policy is parameter-efficient (LoRA on a frozen base model) and trained with Group Relative Policy Optimization to jointly optimize task-consistency, distributional alignment, and diversity under a hard feasibility gate. A Generate--Verify--Fallback pipeline constructs rewrites that pass feasibility checks and are then used for standard SFT, yielding downstream gains comparable to vanilla SFT but with substantially reduced forgetting on general-domain benchmarks. Across multiple backbones and math-focused datasets, the approach demonstrates a favorable gain-forgetting trade-off, suggesting that data-centric rewriting guided by RL can stabilize fine-tuning under distribution shifts and improve robustness in downstream tasks.

Abstract

Large language models (LLMs) have made rapid progress, yet adapting them to downstream scenarios still commonly relies on supervised fine-tuning (SFT). When downstream data exhibit a substantial distribution shift from the model's prior training distribution, SFT can induce catastrophic forgetting. To narrow this gap, data rewriting has been proposed as a data-centric approach that rewrites downstream training data prior to SFT. However, existing methods typically sample rewrites from a prompt-induced conditional distribution, so the resulting targets are not necessarily aligned with the model's natural QA-style generation distribution. Moreover, reliance on fixed templates can lead to diversity collapse. To address these issues, we cast data rewriting as a policy learning problem and learn a rewriting policy that better matches the backbone's QA-style generation distribution while preserving diversity. Since distributional alignment, diversity and task consistency are automatically evaluable but difficult to optimize end-to-end with differentiable objectives, we leverage reinforcement learning to optimize the rewrite distribution under reward feedback and propose an RL-based data-rewriting agent. The agent jointly optimizes QA-style distributional alignment and diversity under a hard task-consistency gate, thereby constructing a higher-quality rewritten dataset for downstream SFT. Extensive experiments show that our method achieves downstream gains comparable to standard SFT while reducing forgetting on non-downstream benchmarks by 12.34% on average. Our code is available at https://anonymous.4open.science/r/Patch-the-Prompt-Gap-4112 .
Paper Structure (67 sections, 19 equations, 4 figures, 7 tables)

This paper contains 67 sections, 19 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: The framework of rewriting agent.
  • Figure 2: Downstream SFT training loss over training steps for three backbones and four methods.
  • Figure 3: Downstream SFT training loss on the success-only subset for Llama-3.2-3B-Instruct.
  • Figure 4: Effect of candidate group size $K$ in GRPO training of the rewriting agent on Llama-3.2-3B-Instruct.