Table of Contents
Fetching ...

Learning to Reason in 13 Parameters

John X. Morris, Niloofar Mireshghallah, Mark Ibrahim, Saeed Mahloujifar

TL;DR

This work tackles the problem of learning to reason with extremely small parameter updates. It introduces TinyLoRA, a parameter-efficient extension of LoRA that scales updates down to a single parameter per module, enabled by sharing and a fixed random projection mechanism. Through reinforcement learning with GRPO on large backbones like Qwen-2.5-7B-Instruct, the authors demonstrate substantial performance gains on GSM8K and MATH benchmarks with only a tiny fraction of trainable parameters, e.g., $13$ parameters (about $26$ bytes in $bf16$) achieving 91% GSM8K accuracy. The findings show RL provides a denser, reward-driven signal than SFT, enabling near full finetuning performance in a fraction of the parameter budget, with scalability trends indicating larger models require even fewer updates, albeit with limitations tied to task domain (math) and dataset characteristics.

Abstract

Recent research has shown that language models can learn to \textit{reason}, often via reinforcement learning. Some work even trains low-rank parameterizations for reasoning, but conventional LoRA cannot scale below the model dimension. We question whether even rank=1 LoRA is necessary for learning to reason and propose TinyLoRA, a method for scaling low-rank adapters to sizes as small as one parameter. Within our new parameterization, we are able to train the 8B parameter size of Qwen2.5 to 91\% accuracy on GSM8K with only 13 trained parameters in bf16 (26 total bytes). We find this trend holds in general: we are able to recover 90\% of performance improvements while training $1000x$ fewer parameters across a suite of more difficult learning-to-reason benchmarks such as AIME, AMC, and MATH500. Notably, we are only able to achieve such strong performance with RL: models trained using SFT require $100-1000x$ larger updates to reach the same performance.

Learning to Reason in 13 Parameters

TL;DR

This work tackles the problem of learning to reason with extremely small parameter updates. It introduces TinyLoRA, a parameter-efficient extension of LoRA that scales updates down to a single parameter per module, enabled by sharing and a fixed random projection mechanism. Through reinforcement learning with GRPO on large backbones like Qwen-2.5-7B-Instruct, the authors demonstrate substantial performance gains on GSM8K and MATH benchmarks with only a tiny fraction of trainable parameters, e.g., parameters (about bytes in ) achieving 91% GSM8K accuracy. The findings show RL provides a denser, reward-driven signal than SFT, enabling near full finetuning performance in a fraction of the parameter budget, with scalability trends indicating larger models require even fewer updates, albeit with limitations tied to task domain (math) and dataset characteristics.

Abstract

Recent research has shown that language models can learn to \textit{reason}, often via reinforcement learning. Some work even trains low-rank parameterizations for reasoning, but conventional LoRA cannot scale below the model dimension. We question whether even rank=1 LoRA is necessary for learning to reason and propose TinyLoRA, a method for scaling low-rank adapters to sizes as small as one parameter. Within our new parameterization, we are able to train the 8B parameter size of Qwen2.5 to 91\% accuracy on GSM8K with only 13 trained parameters in bf16 (26 total bytes). We find this trend holds in general: we are able to recover 90\% of performance improvements while training fewer parameters across a suite of more difficult learning-to-reason benchmarks such as AIME, AMC, and MATH500. Notably, we are only able to achieve such strong performance with RL: models trained using SFT require larger updates to reach the same performance.
Paper Structure (35 sections, 5 equations, 9 figures, 2 tables)

This paper contains 35 sections, 5 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Using Qwen2.5-7B-Instruct as a base model, our TinyLoRA achieves performance within 5% of full finetuning on GSM8K with only 13 parameters. Dashed lines indicate untrained and full-FT baselines.
  • Figure 2: Using Qwen2.5-7B-Instruct as a base model, SFT works best with larger update sizes of at least $1M$ parameters.
  • Figure 3: Minimal-sized parameter update to hit threshold of maximum performance vs. backbone model size. Larger models require smaller updates to reach e.g. 95% of peak performance.
  • Figure 4: Performance ablation using Qwen2.5-3B-Instruct under extremely small update size (<1KB). Surprisingly, storing parameters in fp32 is most performant bit-for-bit.
  • Figure 5: TinyLoRA performance during training on MATH.
  • ...and 4 more figures