Semantic Soft Bootstrapping: Long Context Reasoning in LLMs without Reinforcement Learning
Purbesh Mitra, Sennur Ulukus
TL;DR
The paper tackles the inefficiencies of reinforcement-learning-based long-context reasoning in LLMs by introducing Semantic Soft Bootstrapping (SSB), an RL-free self-distillation framework. SSB constructs a teacher–student paradigm within a single model: a hinted teacher synthesizes robust explanations from correct and common incorrect rollouts, while the hint-free student learns to imitate the teacher's answer-token distributions via logit-level KL distillation. The approach yields substantial improvements over GRPO on MATH500 and AIME2024 benchmarks (≈10% gains) using a small, curated dataset and parameter-efficient fine-tuning, while exhibiting stable training dynamics and no need for increasingly long chain-of-thoughts. These results suggest a scalable, compute-efficient alternative to RLVR that can be extended to larger models and broader domains with in-context hint-based supervision at the logit level.
Abstract
Long context reasoning in large language models (LLMs) has demonstrated enhancement of their cognitive capabilities via chain-of-thought (CoT) inference. Training such models is usually done via reinforcement learning with verifiable rewards (RLVR) in reasoning based problems, like math and programming. However, RLVR is limited by several bottlenecks, such as, lack of dense reward, and inadequate sample efficiency. As a result, it requires significant compute resources in post-training phase. To overcome these limitations, in this work, we propose \textbf{Semantic Soft Bootstrapping (SSB)}, a self-distillation technique, in which the same base language model plays the role of both teacher and student, but receives different semantic contexts about the correctness of its outcome at training time. The model is first prompted with a math problem and several rollouts are generated. From them, the correct and most common incorrect response are filtered, and then provided to the model in context to produce a more robust, step-by-step explanation with a verified final answer. This pipeline automatically curates a paired teacher-student training set from raw problem-answer data, without any human intervention. This generation process also produces a sequence of logits, which is what the student model tries to match in the training phase just from the bare question alone. In our experiment, Qwen2.5-3B-Instruct on GSM8K dataset via parameter-efficient fine-tuning. We then tested its accuracy on MATH500, and AIME2024 benchmarks. Our experiments show a jump of 10.6%, and 10% improvements in accuracy, respectively, over group relative policy optimization (GRPO), which is a commonly used RLVR algorithm. Our code is available at https://github.com/purbeshmitra/semantic-soft-bootstrapping, and the model, curated dataset is available at https://huggingface.co/purbeshmitra/semantic-soft-bootstrapping.
