Table of Contents
Fetching ...

Training Large Reasoning Models Efficiently via Progressive Thought Encoding

Zeliang Zhang, Xiaodong Liu, Hao Cheng, Hao Sun, Chenliang Xu, Jianfeng Gao

TL;DR

Progressive Thought Encoding is introduced, a parameter-efficient fine-tuning method that enables LRMs to reason effectively under fixed-size caches and makes RL training of LRMs substantially more efficient and scalable under real-world memory constraints.

Abstract

Large reasoning models (LRMs) excel on complex problems but face a critical barrier to efficiency: reinforcement learning (RL) training requires long rollouts for outcome-based rewards, where autoregressive decoding dominates time and memory usage. While sliding-window cache strategies can bound memory, they disrupt long-context reasoning and degrade performance. We introduce Progressive Thought Encoding, a parameter-efficient fine-tuning method that enables LRMs to reason effectively under fixed-size caches. By progressively encoding intermediate reasoning into fixed-size vector representations, our approach eliminates the need to backpropagate through full-cache rollouts, thereby reducing memory usage, while maintaining constant memory during inference. Experiments on three models, including Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct, and DeepSeek-R1-Distill-Llama-8B, on six widely used challenging mathematical benchmarks show consistent gains: our method achieves +19.3% improvement over LoRA-based fine-tuning and +29.9% over LRMs without fine-tuning on average, with up to +23.4 accuracy improvement on AIME2024/2025 under the same tight cache budgets. These results demonstrate that Progressive Thought Encoding not only improves reasoning accuracy but also makes RL training of LRMs substantially more efficient and scalable under real-world memory constraints.

Training Large Reasoning Models Efficiently via Progressive Thought Encoding

TL;DR

Progressive Thought Encoding is introduced, a parameter-efficient fine-tuning method that enables LRMs to reason effectively under fixed-size caches and makes RL training of LRMs substantially more efficient and scalable under real-world memory constraints.

Abstract

Large reasoning models (LRMs) excel on complex problems but face a critical barrier to efficiency: reinforcement learning (RL) training requires long rollouts for outcome-based rewards, where autoregressive decoding dominates time and memory usage. While sliding-window cache strategies can bound memory, they disrupt long-context reasoning and degrade performance. We introduce Progressive Thought Encoding, a parameter-efficient fine-tuning method that enables LRMs to reason effectively under fixed-size caches. By progressively encoding intermediate reasoning into fixed-size vector representations, our approach eliminates the need to backpropagate through full-cache rollouts, thereby reducing memory usage, while maintaining constant memory during inference. Experiments on three models, including Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct, and DeepSeek-R1-Distill-Llama-8B, on six widely used challenging mathematical benchmarks show consistent gains: our method achieves +19.3% improvement over LoRA-based fine-tuning and +29.9% over LRMs without fine-tuning on average, with up to +23.4 accuracy improvement on AIME2024/2025 under the same tight cache budgets. These results demonstrate that Progressive Thought Encoding not only improves reasoning accuracy but also makes RL training of LRMs substantially more efficient and scalable under real-world memory constraints.
Paper Structure (15 sections, 7 equations, 8 figures, 4 tables)

This paper contains 15 sections, 7 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Overview of our method. During the rollout process, the model continuously learns the dropped tokens to achieve a balance between generation efficiency and long-term memory.
  • Figure 2: The computation of context state $S$.
  • Figure 3: Evaluation of Qwen-7B-Instruct and DeepSeek-R1-Distill-Llama-8B models trained by different methods on four benchmarks. We set the same maximum number of tokens for generation as 3072, and vary the KV cache window length from 768 to 3072. Each value corresponds to the mean pass@1 score over five independent runs.
  • Figure 3: Training efficiency comparison across different maximum generation lengths during rollout.
  • Figure 4: Ablation study on (a) global token usage and (b) token dropping strategies.
  • ...and 3 more figures