Table of Contents
Fetching ...

Reasoning Under 1 Billion: Memory-Augmented Reinforcement Learning for Large Language Models

Hung Le, Dai Do, Dung Nguyen, Svetha Venkatesh

TL;DR

Memory-R+ introduces a memory-augmented reinforcement learning framework to enable reasoning enhancements in tiny LLMs (one billion parameters or fewer). By maintaining separate episodic memories for successful and failed reasoning, and computing intrinsic rewards through kNN-based readouts, exploitation, and exploration signals are derived without extensive external supervision. The method, trained with GRPO, yields improved reasoning accuracy and robustness across GSM8K and AI-MO, while mitigating reward and length collapse that plague other approaches. This approach lowers the barrier to RL-based reasoning in low-resource settings and demonstrates meaningful gains for models far smaller than those typically used for such techniques.

Abstract

Recent advances in fine-tuning large language models (LLMs) with reinforcement learning (RL) have shown promising improvements in complex reasoning tasks, particularly when paired with chain-of-thought (CoT) prompting. However, these successes have been largely demonstrated on large-scale models with billions of parameters, where a strong pretraining foundation ensures effective initial exploration. In contrast, RL remains challenging for tiny LLMs with 1 billion parameters or fewer because they lack the necessary pretraining strength to explore effectively, often leading to suboptimal reasoning patterns. This work introduces a novel intrinsic motivation approach that leverages episodic memory to address this challenge, improving tiny LLMs in CoT reasoning tasks. Inspired by human memory-driven learning, our method leverages successful reasoning patterns stored in memory while allowing for controlled exploration to generate novel responses. Intrinsic rewards are computed efficiently using a kNN-based episodic memory, allowing the model to discover new reasoning strategies while quickly adapting to effective past solutions. Experiments on fine-tuning GSM8K and AI-MO datasets demonstrate that our approach significantly enhances smaller LLMs' sample efficiency and generalization capability, making RL-based reasoning improvements more accessible in low-resource settings.

Reasoning Under 1 Billion: Memory-Augmented Reinforcement Learning for Large Language Models

TL;DR

Memory-R+ introduces a memory-augmented reinforcement learning framework to enable reasoning enhancements in tiny LLMs (one billion parameters or fewer). By maintaining separate episodic memories for successful and failed reasoning, and computing intrinsic rewards through kNN-based readouts, exploitation, and exploration signals are derived without extensive external supervision. The method, trained with GRPO, yields improved reasoning accuracy and robustness across GSM8K and AI-MO, while mitigating reward and length collapse that plague other approaches. This approach lowers the barrier to RL-based reasoning in low-resource settings and demonstrates meaningful gains for models far smaller than those typically used for such techniques.

Abstract

Recent advances in fine-tuning large language models (LLMs) with reinforcement learning (RL) have shown promising improvements in complex reasoning tasks, particularly when paired with chain-of-thought (CoT) prompting. However, these successes have been largely demonstrated on large-scale models with billions of parameters, where a strong pretraining foundation ensures effective initial exploration. In contrast, RL remains challenging for tiny LLMs with 1 billion parameters or fewer because they lack the necessary pretraining strength to explore effectively, often leading to suboptimal reasoning patterns. This work introduces a novel intrinsic motivation approach that leverages episodic memory to address this challenge, improving tiny LLMs in CoT reasoning tasks. Inspired by human memory-driven learning, our method leverages successful reasoning patterns stored in memory while allowing for controlled exploration to generate novel responses. Intrinsic rewards are computed efficiently using a kNN-based episodic memory, allowing the model to discover new reasoning strategies while quickly adapting to effective past solutions. Experiments on fine-tuning GSM8K and AI-MO datasets demonstrate that our approach significantly enhances smaller LLMs' sample efficiency and generalization capability, making RL-based reasoning improvements more accessible in low-resource settings.

Paper Structure

This paper contains 38 sections, 22 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Memory-R$^+$ Architecture.Left: The LLM receives a query $q$ from training dataset $D$, and generates multiple responses. For each response $a$, in addition to outcome reward $R$ from an Answer Verifier, Memory-R$^+$ introduces intrinsic reward $R_{\text{mem}}$ based on episodic memory. Right: The query $q$ is used to query the failure memory $\mathcal{M}_f$ and success memory $\mathcal{M}_s$ using kNN (red arrows), resulting in corresponding retrieved responses. The intrinsic reward $R_{\text{mem}}$ is computed by comparing the current response $a$ to retrieved ones---encouraging novelty against failed responses (e.g., $a_{1,1}$, $a_{3,1}$, $a_{3,2}$) and rewarding similarity to successful ones (e.g., $a_{5,1}$, $a_{5,2}$, $a_{6,1}$, $a_{6,2}$).
  • Figure 2: Performance of fine-tuning Qwen2.5-0.5B-Instruct on AI-MO data. The test accuracy is evaluated at multiple checkpoints during training (mean$\pm$std. over 3 runs).
  • Figure 3: Reward Mode Collapse in Falcon3-1B-Instruct.
  • Figure 4: Response Length Collapse in Qwen2.5-0.5B-Instruct.
  • Figure 5: More Training Collapses in Qwen2.5-0.5B-Instruct during fine-tuning GSM8K (a) and AI-MO datasets (b). The results have been smoothed to improve clarity and visual appeal.
  • ...and 1 more figures