Table of Contents
Fetching ...

Context Bootstrapped Reinforcement Learning

Saaket Agashe, Jayanth Srinivasa, Gaowen Liu, Ramana Kompella, Xin Eric Wang

Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) suffers from exploration inefficiency, where models struggle to generate successful rollouts, resulting in minimal learning signal. This challenge is particularly severe for tasks that require the acquisition of novel reasoning patterns or domain-specific knowledge. To address this, we propose Context Bootstrapped Reinforcement Learning (CBRL), which augments RLVR training by stochastically prepending few-shot demonstrations to training prompts. The injection probability follows a curriculum that starts high to bootstrap early exploration, then anneals to zero so the model must ultimately succeed without assistance. This forces the policy to internalize reasoning patterns from the demonstrations rather than relying on them at test time. We validate CBRL across two model families and five Reasoning Gym tasks. Our results demonstrate that CBRL consistently improves success rate, provides better exploration efficiency, and is algorithm-agnostic. We further demonstrate CBRL's practical applicability on Q, a domain-specific programming language that diverges significantly from mainstream language conventions.

Context Bootstrapped Reinforcement Learning

Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) suffers from exploration inefficiency, where models struggle to generate successful rollouts, resulting in minimal learning signal. This challenge is particularly severe for tasks that require the acquisition of novel reasoning patterns or domain-specific knowledge. To address this, we propose Context Bootstrapped Reinforcement Learning (CBRL), which augments RLVR training by stochastically prepending few-shot demonstrations to training prompts. The injection probability follows a curriculum that starts high to bootstrap early exploration, then anneals to zero so the model must ultimately succeed without assistance. This forces the policy to internalize reasoning patterns from the demonstrations rather than relying on them at test time. We validate CBRL across two model families and five Reasoning Gym tasks. Our results demonstrate that CBRL consistently improves success rate, provides better exploration efficiency, and is algorithm-agnostic. We further demonstrate CBRL's practical applicability on Q, a domain-specific programming language that diverges significantly from mainstream language conventions.
Paper Structure (41 sections, 1 equation, 5 figures, 14 tables, 1 algorithm)

This paper contains 41 sections, 1 equation, 5 figures, 14 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of the Context Bootstrapped Reinforcement Learning (CBRL) framework. (a) The injection probability annealing schedule. At each timestep $t$, training batches consist of samples without few shots and samples with few shots, with the proportion governed by $p_i$, which decreases linearly from $p_{\mathrm{start}}$ to $p_{\mathrm{end}}$. (b) Construction of the two sample types. Samples without exemplars (top) present only the task query followed by an empty assistant turn. Samples with exemplars (bottom) prepend solved few-shot demonstrations as prior user--assistant exchanges before the target query.
  • Figure 2: Equivalent implementations of a fixed-point search algorithm in Python (left) and Q (right). Both solutions use binary search to find an index $i$ such that $\texttt{arr}[i] = i$. While the Python version follows conventional syntax familiar to pretrained language models, the Q implementation employs a terse, array-oriented style with right-to-left evaluation, minimal punctuation, and implicit returns.
  • Figure 3: Mean accuracy ($\pm$ standard error across three runs) of Qwen2.5-3B-Instruct trained with baseline RLOO and CBRL RLOO across five Reasoning Gym environments. Bold indicates the higher mean.
  • Figure 4: Training reward curves for CBRL and baseline across three settings: Q Programming with GRPO (left), Word Sorting with GRPO (center), and Word Sorting with RLOO (right). Shaded regions indicate $p_i > 0.25$ (high injection). CBRL achieves higher early reward by guiding the model toward successful rollouts, bootstrapping the learning process. The advantage persists after injection stops, demonstrating that CBRL addresses exploration inefficiency without creating long-term dependence on demonstrations.
  • Figure 5: Qualitative comparison of model outputs on the task: "Sort the words violates, yes, already, completing, pages, duty, his, EXPRESS, duly in ascending ASCII order." Baseline (left) offers a superficial explanation and produces an incorrect ordering. CBRL (center) explicitly reasons about ASCII values (e.g., 'E' = 69, 'a' = 97), systematically compares characters, and arrives at the correct answer. The few-shot example (right) shows the step-by-step ASCII reasoning pattern used during CBRL training.