BOTS: A Unified Framework for Bayesian Online Task Selection in LLM Reinforcement Finetuning
Qianli Shen, Daoyuan Chen, Yilun Huang, Zhenqing Ling, Yaliang Li, Bolin Ding, Jingren Zhou
TL;DR
BOTS presents a unified Bayesian framework for online task selection in LLM reinforcement finetuning, tackling the data-inefficiency of uniform task sampling by maintaining a posterior over task difficulty and fusing explicit rollout evidence with inferred implicit evidence. The approach leverages a generalized Beta-Bernoulli update with forgetting and evidence-weighting parameters, plus an ultra-light interpolation plug-in to extrapolate difficulty for unevaluated tasks, all guided by Thompson sampling to balance exploration and exploitation. Empirical results across math, code, and logic domains on 1.5B and 7B models show consistent data-efficiency gains, substantial reductions in training steps to reach baselines, and strong cross-domain robustness, with negligible computational overhead. These findings establish BOTS as a practical, scalable method for dynamic task selection in RFT, enabling more efficient alignment of LLMs to human preferences and reasoning tasks.
Abstract
Reinforcement finetuning (RFT) is a key technique for aligning Large Language Models (LLMs) with human preferences and enhancing reasoning, yet its effectiveness is highly sensitive to which tasks are explored during training. Uniform task sampling is inefficient, wasting computation on tasks that are either trivial or unsolvable, while existing task selection methods often suffer from high rollout costs, poor adaptivity, or incomplete evidence. We introduce BOTS, a unified framework for Bayesian Online Task Selection in LLM reinforcement finetuning. Grounded in Bayesian inference, BOTS adaptively maintains posterior estimates of task difficulty as the model evolves. It jointly incorporates explicit evidence from direct evaluations of selected tasks and implicit evidence inferred from these evaluations for unselected tasks, with Thompson sampling ensuring a principled balance between exploration and exploitation. To make implicit evidence practical, we instantiate it with an ultra-light interpolation-based plug-in that estimates difficulties of unevaluated tasks without extra rollouts, adding negligible overhead. Empirically, across diverse domains and LLM scales, BOTS consistently improves data efficiency and performance over baselines and ablations, providing a practical and extensible solution for dynamic task selection in RFT.
