Table of Contents
Fetching ...

BOTS: A Unified Framework for Bayesian Online Task Selection in LLM Reinforcement Finetuning

Qianli Shen, Daoyuan Chen, Yilun Huang, Zhenqing Ling, Yaliang Li, Bolin Ding, Jingren Zhou

TL;DR

BOTS presents a unified Bayesian framework for online task selection in LLM reinforcement finetuning, tackling the data-inefficiency of uniform task sampling by maintaining a posterior over task difficulty and fusing explicit rollout evidence with inferred implicit evidence. The approach leverages a generalized Beta-Bernoulli update with forgetting and evidence-weighting parameters, plus an ultra-light interpolation plug-in to extrapolate difficulty for unevaluated tasks, all guided by Thompson sampling to balance exploration and exploitation. Empirical results across math, code, and logic domains on 1.5B and 7B models show consistent data-efficiency gains, substantial reductions in training steps to reach baselines, and strong cross-domain robustness, with negligible computational overhead. These findings establish BOTS as a practical, scalable method for dynamic task selection in RFT, enabling more efficient alignment of LLMs to human preferences and reasoning tasks.

Abstract

Reinforcement finetuning (RFT) is a key technique for aligning Large Language Models (LLMs) with human preferences and enhancing reasoning, yet its effectiveness is highly sensitive to which tasks are explored during training. Uniform task sampling is inefficient, wasting computation on tasks that are either trivial or unsolvable, while existing task selection methods often suffer from high rollout costs, poor adaptivity, or incomplete evidence. We introduce BOTS, a unified framework for Bayesian Online Task Selection in LLM reinforcement finetuning. Grounded in Bayesian inference, BOTS adaptively maintains posterior estimates of task difficulty as the model evolves. It jointly incorporates explicit evidence from direct evaluations of selected tasks and implicit evidence inferred from these evaluations for unselected tasks, with Thompson sampling ensuring a principled balance between exploration and exploitation. To make implicit evidence practical, we instantiate it with an ultra-light interpolation-based plug-in that estimates difficulties of unevaluated tasks without extra rollouts, adding negligible overhead. Empirically, across diverse domains and LLM scales, BOTS consistently improves data efficiency and performance over baselines and ablations, providing a practical and extensible solution for dynamic task selection in RFT.

BOTS: A Unified Framework for Bayesian Online Task Selection in LLM Reinforcement Finetuning

TL;DR

BOTS presents a unified Bayesian framework for online task selection in LLM reinforcement finetuning, tackling the data-inefficiency of uniform task sampling by maintaining a posterior over task difficulty and fusing explicit rollout evidence with inferred implicit evidence. The approach leverages a generalized Beta-Bernoulli update with forgetting and evidence-weighting parameters, plus an ultra-light interpolation plug-in to extrapolate difficulty for unevaluated tasks, all guided by Thompson sampling to balance exploration and exploitation. Empirical results across math, code, and logic domains on 1.5B and 7B models show consistent data-efficiency gains, substantial reductions in training steps to reach baselines, and strong cross-domain robustness, with negligible computational overhead. These findings establish BOTS as a practical, scalable method for dynamic task selection in RFT, enabling more efficient alignment of LLMs to human preferences and reasoning tasks.

Abstract

Reinforcement finetuning (RFT) is a key technique for aligning Large Language Models (LLMs) with human preferences and enhancing reasoning, yet its effectiveness is highly sensitive to which tasks are explored during training. Uniform task sampling is inefficient, wasting computation on tasks that are either trivial or unsolvable, while existing task selection methods often suffer from high rollout costs, poor adaptivity, or incomplete evidence. We introduce BOTS, a unified framework for Bayesian Online Task Selection in LLM reinforcement finetuning. Grounded in Bayesian inference, BOTS adaptively maintains posterior estimates of task difficulty as the model evolves. It jointly incorporates explicit evidence from direct evaluations of selected tasks and implicit evidence inferred from these evaluations for unselected tasks, with Thompson sampling ensuring a principled balance between exploration and exploitation. To make implicit evidence practical, we instantiate it with an ultra-light interpolation-based plug-in that estimates difficulties of unevaluated tasks without extra rollouts, adding negligible overhead. Empirically, across diverse domains and LLM scales, BOTS consistently improves data efficiency and performance over baselines and ablations, providing a practical and extensible solution for dynamic task selection in RFT.

Paper Structure

This paper contains 65 sections, 4 theorems, 27 equations, 20 figures, 13 tables.

Key Result

Proposition 1

Let $p\in(0,1)$ be the Bernoulli success probability at time $t$. Suppose the current belief is $\pi_t(p)=\mathrm{Beta}(p\mid \alpha_t,\beta_t)$, and let $\pi_0(p)=\mathrm{Beta}(p\mid \alpha_0,\beta_0)$ be a base prior. Given counts $(s_t,f_t)$ and pseudo counts $(\tilde{s}_t,\tilde{f}_t)$ with $s_t with $\lambda\in(0,1)$ and $\rho\in[0,1]$. Then $\pi_{t+1}$ is exactly $\mathrm{Beta}(\alpha_{t+1},

Figures (20)

  • Figure 1: Overview of the BOTS framework. BOTS operates in a continuous loop of task selection, model training, and posterior updating. (1) Selection: Thompson sampling from the posterior beliefs selects a batch of tasks whose estimated success probabilities are near a target difficulty (e.g., $p^*=0.5$). (2) Training & Evidence Collection: The LLM is finetuned, yielding direct success/failure counts (explicit evidence) for the selected batch. For unselected tasks, predicted counts (implicit evidence) are produced by a plug-in; in Section \ref{['sec:implicit_evidence']}, we introduce an ultra-lightweight interpolation-based variant with negligible overhead. (3) Posterior Updating: Explicit and implicit evidence are fused using our generalized Bayesian update rule (Section \ref{['sec:diff_estimation']}).
  • Figure 2: Qwen2.5-1.5B-Instruct on Math. Ratio of sampled training tasks (measured over 16 rollouts) with passing rates: strictly between 0 and 1, strictly greater than 0, and strictly less than 1, along with aggregated performance (MATH500 and AIME24), plotted against training steps.
  • Figure 3: Qwen2.5-1.5B-Instruct on Math. Ratio of sampled training tasks (measured over 16 rollouts) with passing rates: strictly between 0 and 1 (L1), strictly greater than 0 (L2), and strictly less than 1 (L3), along with MATH500 Accuracy (avg@1), plotted against training steps.
  • Figure 4: Wall-clock time breakdown across training phases for Qwen2.5-1.5B-Instruct (Left) and Qwen2.5-7B (Right) on GURU-Math, trained on 8 A100 GPUs. Runtime is averaged over the first 100 training steps. The cost of task selection—including posterior sampling, index sorting, and distribution parameter updates—is negligible compared to overall training.
  • Figure 5: Performance of the implicit evidence estimator during training of Qwen2.5-1.5B-Instruct (Left) and Qwen2.5-7B (Right), measured by Pearson Correlation and ROC AUC.
  • ...and 15 more figures

Theorems & Definitions (6)

  • Proposition 1
  • Proposition 2
  • Proposition 2
  • proof
  • Proposition 2
  • proof