Table of Contents
Fetching ...

RL-Guided Data Selection for Language Model Finetuning

Animesh Jha, Harshit Gupta, Ananjan Nandi

TL;DR

This paper tackles the challenge of selecting a small, high-quality data subset to finetune large language models under a fixed data budget. It reformulates data selection as a tractable Markov Decision Process operating on semantic clusters, and trains RL agents (DQN, PPO) to sequentially assemble subsets using a proxy-model validation signal as reward, augmented by exploration and reward-model-based rollouts. Across four diverse tasks, the method often matches or exceeds full-dataset finetuning with only $5\%$ of the data and achieves up to $10.8$ accuracy point gains (notably on MetaHate), while reducing wall-clock time by up to $2\times$. The findings demonstrate that RL-guided data selection can effectively filter noisy or redundant data, delivering strong downstream performance with substantial efficiency gains, and highlight the importance of reward design and clustering strategy in this setting.

Abstract

Data selection for finetuning Large Language Models (LLMs) can be framed as a budget-constrained optimization problem: maximizing a model's downstream performance under a strict training data budget. Solving this problem is generally intractable, and existing approximate approaches are pretraining-oriented and transfer poorly to the fine-tuning setting. We reformulate this problem as a tractable Markov Decision Process (MDP) and train agents using various Reinforcement Learning (RL) methods to learn optimal data selection policies, guided by an efficient, proxy-model-based reward signal. Across four datasets, training on a $5\%$ subset selected by our approach matches or outperforms fine-tuning on the full dataset by up to $10.8$ accuracy points, while cutting wall-clock training time by up to $2 \times$, highlighting the promise of RL-guided data selection.

RL-Guided Data Selection for Language Model Finetuning

TL;DR

This paper tackles the challenge of selecting a small, high-quality data subset to finetune large language models under a fixed data budget. It reformulates data selection as a tractable Markov Decision Process operating on semantic clusters, and trains RL agents (DQN, PPO) to sequentially assemble subsets using a proxy-model validation signal as reward, augmented by exploration and reward-model-based rollouts. Across four diverse tasks, the method often matches or exceeds full-dataset finetuning with only of the data and achieves up to accuracy point gains (notably on MetaHate), while reducing wall-clock time by up to . The findings demonstrate that RL-guided data selection can effectively filter noisy or redundant data, delivering strong downstream performance with substantial efficiency gains, and highlight the importance of reward design and clustering strategy in this setting.

Abstract

Data selection for finetuning Large Language Models (LLMs) can be framed as a budget-constrained optimization problem: maximizing a model's downstream performance under a strict training data budget. Solving this problem is generally intractable, and existing approximate approaches are pretraining-oriented and transfer poorly to the fine-tuning setting. We reformulate this problem as a tractable Markov Decision Process (MDP) and train agents using various Reinforcement Learning (RL) methods to learn optimal data selection policies, guided by an efficient, proxy-model-based reward signal. Across four datasets, training on a subset selected by our approach matches or outperforms fine-tuning on the full dataset by up to accuracy points, while cutting wall-clock training time by up to , highlighting the promise of RL-guided data selection.

Paper Structure

This paper contains 41 sections, 6 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: UMAP projections of explored state (binary mask) encodings, colored by their subsampled validation set accuracy.
  • Figure 2: Downstream performance vs. number of clusters for ANLI with Random-Search and stratified k-means.
  • Figure 3: Histogram of label ratios across clusters using K-means in the GooglePlay dataset.
  • Figure 4: Downstream Performance vs Training Times for the Random and Full baselines, along with two DQN-based approaches.